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significant  accomplishments  achieved  during  the  total  period  of  the  research  effort  covered  by  this 
coiuraCL.  ii  ceais  Vv  itn  several  topics  reiatea  to  tne  mat.hematical  properties  or  feedforward  as  v\'eil  as 
feedback  nets,  touching  upon  the  following  subareas:  For  feedforward  nets:  local  minima,  uniqueness 
of  weights,  architectures,  classification  capacity,  uses  for  implementation  of  state-feedback  control  - 
especially  for  the  control  of  linear  systems  with  saturation,-  approximation  rates,  and  (P.AC)  learning 
issues.  For  recurrent  nets:  recurrent  perceptrons,  parameter  identification  and  learning  algorithms, 
systems-theoretic  properties  (observ'ability,  controllability),  and  theoretical  computational  capabilities. 
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1  Introduction 

A  huge  amount  of  activity  has  taken  place  during  the  last  few  years  in  the  area  encompassed  by  the 
term  "artificial  neural  networks  ”  as  evidenced  by  the  proliferation  of  conferences,  journals,  electronic 
discussion  groups,  and  patent  applications.  This  area  had  seen  many  periods  of  hype  (and  justified 
reaction  to  that  hype)  since  the  late  1940s.  While  there  was  continuing  activity  by  major  researchers 
such  as  Kohonen  and  Grossberg  all  along,  the  latest  resurgence  in  interest  was  due  in  large  part  to 
the  appearance  of  two  separate  lines  of  work;  (1)  feedforward  nets  or  “multilayer  perceptrons”  with 
sigmoidal  activations”  to  be  fit  to  data  through  nonlinear  optimization,  and  (2)  feedback  nets,  with  a 
technique  for  associative  memory  storage  and  retrieval  due  to  Hopfield.  The  impact  of  this  work  was 
amplified  by  the  coincidental  sudden  easy  availability  of  huge  raw  computational  power,  in  amounts 
which  were  unimaginable  when  similar  ideas  had  been  considered  in  the  past. 

Main  Motivation  for  This  Work 

Many  practical  successes  of  the  associated  technologies  have  been  claimed  in  both  the  engineering  and 
popular  press.  In  the  context  of  the  AFOSR  mission,  one  may  mention  a  1990  workshop  centered 
on  aerospace  applications  of  neural  nets,  which  took  place  at  McDonnell  Douglas  Corp.  and  was 
attended  by  numerous  AFOSR  contractors  (including  the  first  PI)  and  researchers  from  several  AF 
development  labs.  Presentations  focused  on  the  use  of  netw'ork  techniques  in  flight  systems  design. 
juCii  tij  lauii  ucCc^iiuii  ailu  claisuicauon  lanu  associated  prooiems  of  reconriguraoie  aircraft  control), 
development  of  controls  valid  over  large  flight  envelopes,  and  even  precision  laying  of  composites  on 
aircrart  structures.  .4  recurrent  concern  raised  throughout  the  discussions  durins  the  workshon  was  the 
lack  or  theoretical  foundations  for  the  analvsis  and  compari.son  of  different  network  tools.  The  main 

stuas  or  tne  capabilities  and  pen'ormance  of  neural  networks.  The  Pi's  are  mathematicians  who  are 
currently  engaged  in  a  program  of  research  whose  main  purpose  is  to  carry  out  a  rigorous  mathematical 
analysis  for  a  number  of  problemiS  in  neural  nets  for  which,  so  far,  only  heuristic  methods  have  been 
developed.  They  believe  that  such  a  development,  besides  being  intrinsically  of  interest,  leads  to  a  better 
understanding  of  the  issues  involved  in  the  design  of  efficient  algorithms  as  well  as  in  understanding  the 
possibilities  and  limitations  of  these  models. 


Another  Motivation 


There  is  another  motivation  as  well  that  underlies  the  work  described  here,  and  is  more  basic  than  the 
understanding  ot  neural  net  models  in  themselves.  This  aris.es  from  an  issuf*  that  has  come  up  in  m.a.p'' 


l  ;  V,  ti t  >_'  > 


concermiig  tiic  inteiiacc 


between  on  the  one  hand  the  continuous,  physical,  world  and  on  the  other  hand  discrete  devices  such 
as  digital  computers,  capable  of  symbolic  processing.  Lately  the  term  “hybrid”  systems  has  become 
uooiikir  in  liiis  context. 


For  instance,  classical  control  techninnes  psneeinllv  for  linpnr  svstpms  rn\'p  nroi.'cd 
successful  in  automatically  regulating  relatively  simple  systems;  however,  for  large-scale  problems, 
controllers  resulting  from  the  application  of  the  well-developed  theory  are  used  as  building  blocks  of 
more  complex  systems.  The  integration  of  these  systems  is  often  accomplished  by  means  of  ad-hoc 
techniques  that  combine  pattern  recognition  devices,  various  types  of  switching  controllers,  and  humans 

“Contrary  to  popular  misconceptions,  multilayer  nets  -but  with  discontinuous  activations-  had  already  been  much  studied 
in  the  widely  unread  but  widely  cited  early  1 960s  book  by  Rosenblatt,  and  feedback  nets  were  studied  by  him  as  well. 
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-or,  more  recently,  expert  systems,-  in  supervisory'  capabilities.  This  has  caused  renewed  interest 
in  the  formulation  of  mathematical  models  in  which  the  interface  betw'een  the  continuous  and  the 
symbolic  is  naturally  accomplished  and  system-theoretic  questions  can  be  formulated  and  resolved  for 
the  resulting  models.  Successful  approaches  will  eventually  allow  the  interplay  of  modem  control  theory 
with  automata  theory'  and  other  techniques  from  computer  science.  This  interest  has  motivated  much 
research  into  areas  such  as  discrete-event  systems,  supervisory  control,  and  more  generally  “intelligent 
control  systems”.  The  present  work  has  as  one  of  its  underlying  objectives  the  analysis  of  neural 
networks  as  a  paradigm  in  which  to  understand  many  of  these  issues.  Other  paradigms  could  be  used  as 
well:  it  so  turns  out  that  neural  nets  are  a  particularly  appealing  and  extremely  natural  class  of  nonlinear 
systems. 

Similarly,  in  numerical  analysis  and  optimization,  much  activity  has  taken  place  in  the  realm  of 
continuous  algorithms.  Included  here  are  such  areas  as  differential-equation  implementations  of  “interior 
methods”  for  linear  and  nonlinear  programming,  the  use  of  flows  on  manifolds  to  solve  eigenvalue  and 
optimization  problems  (in  particular  the  work  of  Brockett  and  his  school)  and  the  Blum-Shub-Smale 
approach  to  “real  valued”  algorithms.  All  these  deal  in  one  way  or  another  with  the  power  of  “analog 
computing”  to  solve  problems  that  can  also  be  attacked  with  discrete/symbolic  techniques.  Again,  we 
view  the  neural  net  paradigm  as  one  in  which  to  explore  the  interface  between  '"digital”  and  "analog” 
modes  of  computation.  In  fact,  recent  work  by  one  of  the  Pis  and  his  students  has  succeeded  in 
formulating  a  new  computer-science  approach  to  these  issues,  as  will  be  described  in  the  report. 


V\  hy  Neural  Nets'.'' 

MoflvLiiing  the  use  of  nets  is  the  belief  tha:  in  some  sense  they  are  an  especially  ancropnaic  f: 
parameterized  models,  as  ooposed  to.  sav,  finite  Fourier  series  or  snlines.  Tvpica!  emrineerinz  i’ 


nuiucriCai  ana  stacisticai  advantages  0,0  said  to  inciuae  exceiient  capabilities  tor  "learning.  "uGapia- 
tion,”  and  “generalization.”  We  do  not  argue  the  case  for  or  against  this  belief  here.  Notwithstanding 
some  popular  claims  to  the  contrary  tfor  instance,  supported  by  recent  rate  of  approximation  results, 
a  topic  which  is  reviewed  in  the  report)  we  feel  that  the  situation  is  still  very  unclear  regarding  the 
relative  merits  of  using  neural  nets.  In  any  case,  it  is  quite  likely  that  techniques  based  on  neural 
netw  orks  will  play  some  role  as  part  of  the  general  set  of  tools  in  estimation,  learning,  and  control.  As 
mathemiaticians,  we  are  interested  at  this  stage  in  obtaining  a  deeper  understanding  of  the  topic,  as  a 
prerequisite  to  theoretical  comparisons. 


Scope  and  Organization  of  the  Report 


;twork”  lends  to  be  applied  looseK.  makinv  fbe  area  ti'.a  hroafi  an;;i  iH-dc! 


purposes  of  this  report,  however,  we  adopt  the  most  popular  paradigm:  (artificial^  neural  nets  are  sys¬ 
tems  composed  of  saturation-type  nonlinearities,  linear  elements,  and  optionally  dynamic  components 


out  a  variety  of  topics  sometimes  included,  such  as  “radial  basis”  networks  or  multiplicative  effects.  (By 
narrow  ing-uown  tne  area  we  can  get  stronger  general  mathematical  results,  but  no  value  judgments  are 
implied  regarding  the  interest  of  such  alternative  structures.)  It  is  possible  to  organize  the  topic,  once  it 
is  so  defined,  by  classifying  nets  into  broad  categories,  and  studying  separately  different  mathematical 
questions  that  are  natural  for  each  subclass. 

The  topics  to  be  discussed  will  be  organized  roughly  into  these  broad  categories: 


•  No  Dynamics  (“Feedforward  Nets”) 
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-  Local  Minima  and  Uniqueness  of  Weights 

-  Architectures  and  Classification  Capacity 

-  Uses  for  Implementation  of  State-Feedback  Control 

0  Single  vs  Multiple  Layers 
0  Control  of  Linear  Systems  with  Saturation 

-  Approximation  Rates 

-  Learning 

•  Linear  Dynamics  (“Recurrent  Perceptrons”) 

•  Nonlinear  Dynamics  (“Fully  Recurrent  Nets”) 

-  Parameter  Identification,  Observability,  Controllability. 

-  Computational  Capabilities 

The  next  Part  provides  a  brief  outline  of  the  above  areas,  with  as  few  technical  details  as  sufficient 
to  convey  the  main  issues  and  results.  The  last  Part  provides  some  more  details  on  selected  subtopics, 
but  overall  most  of  the  discussion  in  this  report  is  informal,  with  references  to  the  literature  for  precise 
technical  points. 
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2  Overview 

As  discussed  above,  by  an  artificial  neural  net  we  mean  here  a  system  w'hich  is  as  an  interconnection 
of  basic  processors,  each  of  which  takes  as  its  inputs  the  outputs  from  other  processors  as  well  as 
from  outside  the  system,  and  performs  a  certain  nonlinear  transformation  ct  :  R  —  R  (the  “activation 
function”)  on  an  affine  combination  of  these  inputs.  For  simplicity,  w'e  will  assume  that  each  processor 
uses  the  same  transformation  a.  The  coefficients  of  the  affine  combination,  or  “weights,”  of  course 
vary  from  processor  to  processor.  The  output  signal  produced  by  each  such  basic  processing  element 
is  broadcast  to  the  rest  of  the  system;  some  of  the  signals  are  combined  into  the  output  of  the  whole 
system.  Thus  the  basic  unit  for  interconnections  is  as  in  Figure  1. 


inputs 


affine 

combination 


output 


Figure  1 ;  Basic  Processing  Element 


The  interconnection  structure  is  often  called  the  architecture  of  the  net;  together  with  the  choice  of 
<7  and  the  values  for  weights  -which  are  typically  thought  of  as  “programmable”  pararnieters  and  are  the 
subject  of  numerical  optimization-  it  determines  comipletely  the  behavior  of  the  network;  later  we  of 
course  give  more  precise  definitions. 

Cl  L:;c  i' LlIlC tlul G  1:^  l  —  X,  iCC  IZCTO  fOT  X  —  U).  Of  US  rClClUV'C.  tHC 

hcirdlimiLer,  chresnoid.  or  Heaviside  function  '/t(  x ).  which  eouais  1  it  x  >  0  and  0  for  ,r  <  0  fin  either 

various  numerical  techniques,  one  often  needs  a  differentiable  a  that  somehow  approximates  sign(x) 

sign  function  when  the  "gain”  7  is  large  in  tanhjvr).  Equivalently,  up  to  translations  and  change  of 
coordinates,  one  may  use  the  standard  sigmoid 

1 

adx)  =  - - -  , 

1  A  e~- 

Also  common  in  practice  is  a  piecewise  linear  function. 


sometimes  called 


(  —  1  if  X  <  -  ! 
7r(  X )  =  I  1  if  X  >  1 


semilinear"  or  "saturated  linearity"  function.  See  Figure  2. 


sign 


Figure  2;  Different  Functions  a 


Whenever  time  behavior  is  of  interest,  one  also  includes  dynamic  elements,  namely  delay  lines  if 
dealing  with  discrete-time  systems,  or  integrators  in  the  continuous-time  case,  so  that  a  well-defined 
dynamical  system  results. 
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Static  nets  are  those  formed  by  interconnections  without  loops  -otherwise  the  behavior  may  not 
be  well-defined-  and  are  also  caWed  feedforward  nets,  in  contrast  to  the  terms  feedback,  recurrent,  or 
dynamic  nets  used  in  the  general  case.  Figure  3  provides  “block  diagrams”  for  the  examples  of  a  static 
net  computing  the  function 

y  =  2c7[3a(5ui  —  !/2 )  t  2cf[u  \  -b  2u2  -i-  1 )  -b  1]  -r  Sa[-'Sa[  tii  -b  2^2  +  1 )  -  1] 
and  a  continuous-time  dynamic  net  representing  the  system 

x\  —  a(2x I  +  X2  —  ui  +  uV) ,  X2  —  (j{—X2  -b  3m2)  5  y  —  ^\- 


Figure  3:  Examples  of  one  Feedforward  and  one  Recurrent  Net 

Mathematically  we  think  of  feedfor.vard  nets  as  computing  functions  betv/een  finite-dimensional 
input  and  output  spaces;  for  instance,  in  the  first  example  in  Figure  3  the  net  induces  a  mapping  R"  —  R. 
Dynamic  nets,  in  contrast,  are  a  subclass  of  systems  in  the  sense  of  control  theory',  and  must  be  defined  in 
terms  ot  dtfterent'a;  or  difference  ecuations  vsetn  incuts  and  outputs.  Tiicv  g;\  e  rise  to  mappings  cets'.  eeit 
function  ipaccs.  Fur  instance,  the  second  example  in  Figure  3  couid  be  seen  as  inducing  a  m.apmng 


(assuming  a  fixed  initial  state  such  as  x  i  ( 0)=X2(  0)=0  and  assuming  sufficient  regularity,  for  instance  that 


n*  i  c  glnhtal  ]  v  L-iO'^chi  fc"  TT  zs  c!  r."  "if  cc.ti  r“ , 

and  as  approximators  of  functions,  and  appear  for  instance  when  implementing  static  feedback  laws  for 
certain  controlled  systems.  Dynamic  nets  provide  an  appealing  class  of  nonlinear  dynamical  systems. 
?.nd  hoA'?  b^cn  us^d  in  irnpisrnsnt2ticns  of  ciossif.ors  fer  prcblom3  iriVol\'iri2  corrcliitcd  tcrnpor2i  data, 
as  well  as  universal  models  for  adaptive  control.  Some  of  these  roles  will  be  discussed  m.ore  later. 

An  intermediate  case  of  interest  between  feedforward  nets  and  the  full  feedback  structure  is  given 
by  what  the  Pis  call  recurrent  perceptrons  by  analogy  with  the  perceptrons  that  have  classically  been 
studied  in  the  area.  In  this  case,  the  dy  namical  behavior  is  linear  but  there  is  a  neniinearity  at  the  output. 
This  special  case  will  be  discussed  separately. 


Local  Minima  and  Weight  Uniqueness  for  Feedforward  Nets 

Typical  applications  of  neural  nets  are  to  binary  clas';ifiratinn,  interp'''iatinn,  and  fim'riinn  annrovirp'irir'n 
problems.  For  any  given  fixed  architecture  (and  choice  of  activation  function),  the  free  parameters 

/ 4- ,,  j — j  . ^  .  1.  . .  =  ..  .  .  .  * .  .  r-  •  .  •  .  »  *  ,  ,  *  . 

,  . .  V,  .^1 .  C.  V-.J  ^  e m  i\^  appiOA-iHitlLU  U  lUiicHUli  Of  ai.l.U.ll.Ji  cl  Ulci2)5iliL-clLlUU  U  Uj  s--'w  Li  *  C  . 

During  a  training  or  supervised  learning  stage,  labeled  examples  are  presented,  and  parameters  are 
adjusted  so  as  to  make  the  network’s  numerical  output  close  to  the  desired  values.  A  steepest-descent 
(or  elaborations  thereof)  minimization  of  an  error  criterion  is  often  used  at  this  point.  Later,  during  actual 
operation,  the  output  given  by  the  net  when  a  new  input  is  presented  will  be  taken  as  the  network’s  guess 
at  the  “right”  value.  Presumably,  one  advantage  of  using  nets  in  this  mode  is  that  the  parallel  processing 
structure  of  nets  means  that  this  guess  could  be  computed  fairly  fast  in  special-purpose  hardware. 
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One  may  attempt  to  justify  the  use  of  such  models  for  prediction  applications  on  the  basis  of 
computational  learning  theory  -see  below-  but  most  empirical  work  uses  statistical  cross-validation 
techniques  for  this  purpose.  Whatever  the  Justification,  a  major  set  of  issues  to  be  addressed  centers  on 
the  fact  that  the  procedure  involves  a  hard  nonlinear  minimization  problem.  Thus,  understanding  the 
structure  of  the  set  of  local  minima  of  the  error  function  is  a  must.  In  early  work  in  this  area  by  the 
Pis  and  others,  the  existence  of  nonglobal  local  minima  was  rigorously  shown  to  happen  in  even  the 
simplest  possible  architectures.  For  a  special  case,  a  recasting  in  terms  of  “nonstrict”  error  functions 
results  in  a  problem  which  does  not  suffer  from  spurious  local  minima,  and  for  which  gradient  descent 
can  be  shown  to  converge  globally  (see  [1]),  but  in  the  general  case  the  existence  of  such  obstmctions 
is  unavoidable. 

Moreover,  and  related  to  this,  many  weights  could  potentially  give  rise  to  the  same  input/output 
behavior,  and  this  multiplicity  will  obviously  affect  the  optimization  algorithm,  since  then  different 
weights  give  the  same  cost.  Hence,  one  must  ask  what  the  group  of  behavior-preserving  weight 
transformations  looks  like.  This  question  was  posed  in  the  late  1980s  by  Hecht-Nielsen.  An  answer 
was  provided  with  a  general  result  characterizing  the  possible  symmetries  for  the  general  case  of  one 
hidden  layer  (IHL)  nets  -the  most  widespread  architecture-  by  one  of  the  Pis.  in  [4].  Such  nets  are 
those  which  diagrammatically  can  be  represented  as: 


Pi^ur*?  4!  (?*'€  Hiddi^y.  (IHL.)  Msy 


Here  one  deals  with  functions  that  can  be  expressed  in  the  form  of  an  affine  combination  of  terms  each 
of  which  consists  of  the  activation  a  applied  to  an  affine  combination  of  inputs: 


Co  -r  ^  C,  C7(  B,u  -T  b.: ) 

■  —  1 


;i) 


where  b,  is  the  ;  th  row  of  the  miatrix  B  of  interconnection  coefneients  fromi  inputs  u.  to  the  "hidden 
layer."  In  miore  compact  notation,  the  function  represented  is  /(it)  =  Cdn{Bu  +  b)  +  cq,  where 


Ipp! ICtlt iOn  of  u'  lO  C3.v^i  1  k,OOikiliitii.C  Oi  tJ.n  . 

dn{X\ . Xr.)  =  (o-(-Ti). - (Z(Xn))  . 


(2) 


(The  subscript  n  will  be  omitted  as  long  as  its  value  is  clear  from  the  context.  Also,  these  notations 
are  used  if  there  are  vector  instead  of  scalar  outputs  y,  in  which  case  C  is  a  matrix  and  cq  a  vector.) 
In  other  words,  the  behavior  of  a  IHL  network  is  a  composition  of  the  type  fodog,  where  /  and  g  are 
affine  maps.  In  the  case  when  u  is  a  scalar  (m=  1),  one  is  studying  the  span  of  dilates  and  translates  of 
a,  somewhat  in  the  spirit  of  wavelet  theory.  The  interest  is  in  understanding  how  many  possible  such 
representations  there  may  exist  for  a  given  function. 
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The  result  in  [4]  established  the  finiteness  of  the  group  of  symmetries  under  obvious  generic 
assumptions  -for  instance,  if  some  c,  vanishes,  the  corresponding  5,  cannot  be  unique-  for  the  special 
case  of  the  activation  tanh.  The  result  is  that  invariance  only  holds  under  the  obvious  transformations: 
permutations  and  sign  reversals  in  each  term.  The  proof  is  based  on  asymptotic  analysis  techniques. 
In  work  of  the  other  PI  and  coworkers  ([17])  the  result  was  extended  to  a  wide  class  of  real-analytic 
functions,  using  residue  arguments.  (The  issue  of  weight  symmetries  has  also  been  studied  for  recurrent 
nets,  by  the  Pis  and  students,  as  mentioned  later.)  Together  with  results  guaranteeing  that  a  small  number 
of  samples  is  enough  in  order  to  determine  the  external  behavior,  as  shown  in  the  recent  paper  [33]  as 
a  consequence  of  results  on  stratification  of  subanalytic  sets,  this  means  that  the  map  that  evaluates  the 
function  at  sufficiently  many  samples  has  generically  discrete  fibers.  Together  with  recent  results  in 
logic  that  apply  to  tanh  (again  see  [33]),  this  provides  results  on  generic  structure  of  local  minima  sets 
for  the  error  function. 


Architectures  and  Classification  Capacity 


It  is  essential  to  understand  how  many  units  are  needed  in  order  to  solve  a  given  classification  or 
interpolation  task.  One  of  the  Pis  has  worked  substantially  on  such  issues  (see  e.g.  [3]),  and  this  will  be 
described  in  the  report. 

.A.  closely  connected  question  concerns  the  study  of  the  relative  advantages  of  different  architectures. 
In  particular,  it  is  by  now  well-known,  thanks  to  the  pioneering  work  of  Cybenko.  Homik,  White. 
Lesnno.  ana  otners,  mat  iHL  networks  have  universal  approximation  properties,  assuming  very  little 
on  the  activation  a  besides  it  not  being  polynomial  (for  simple  proofs,  which  apply  for  restricted  but 
vet  quite  ceneral  activations,  see  118]'*.  These  resinrs  a.re  sr.'ired  in  terms  of  densitv  prooerties  in  spaces 


However,  the  quality  of  approximation  with  such  IHL  architectures  may  not  be  as  good,  for  fixed 


niimK^r  rvf  )  oc  tf'-' 


instance,  the  characteristic  function  of  a  square  in  the  plane  can  be  easily  approximated  by  a  nvo  hidden 
layer  (2HL)  net.  but  good  approximations  using  IHL  nets  require  many  terms.  More  generally,  one 
m'lkc  stnt^rn^nt?  for  closs’fs  'vo!!-2pT?roxiTnotfd  bv  ccr 


^  1 1 1  n cit6  sp  1  in ss 


rslcitc 


disadvantage  of  IHL  vis  a  vis  2HL  nets  arises  from  the  topologies  in  which  the  IHL  approximation 
theorems  hold.  One  of  the  Pis,  in  [2],  gave  a  general  theorem  that  deals  with  this  issue.  .A  basic 
problem  is  that  uniform  approximation  of  discontinuous  functions  is  in  general  impossible  using  IHL 
nets;  technicaHy,  there  is  no  density  on  i'^',  even  on  compact  subsets,  and  an  even  stronger  negative 
result  holds,  regarding  the  impossibility  of  constructing  sections  of  certain  coverings.  This  has  serious 

ha^  bee:;  noted  e.\pcrinieniaiiy  by  many  authors.  The  same  probiem  appears  in  control  applications,  as 
explained  later. 


implementations 


of  S 


Feedback  Control;  Saturated  Actuators 


Among  the  most  popular  areas  of  application  for  neural  networks  is  that  of  control  (see  for  instance  the 
survey  [18]).  This  part  of  the  report  touches  upon  various  aspects  of  that  topic. 

One  line  of  recent  -but  already  widely  cited-  work  of  the  Pis  deals  with  the  use  of  networks  for  the 
control  of  linear  systems  subject  to  actuator  saturation. 


X  =  Ax  +  Ba{u) 
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where  .4  and  B  are  as  usual  in  linear  control  theor>'  and  c  is  a  saturation  such  as  tanh  or  rr.  It  is 
often  said  that  saturation  is  the  most  commonly  encountered  nonlinearity  in  control  engineering,  so  the 
development  of  techniques  for  the  control  of  such  systems  is  obviously  of  great  interest.  Our  work  starts 
with  the  result  by  Fuller,  around  1970,  that  it  is  in  general  impossible  to  globally  stabilize  the  origin  of 
such  systems  by  means  of  linear  feedback  u=Fx  -for  a  more  general  result  along  those  lines,  see  [231- 
even  if  the  system  is  open-loop  globally  controllable  to  the  origin.  This  suggests  the  obvious  question 
of  searching  for  no/zlinear  feedback  laws  u=kix)  that  achieve  such  stabilization.  The  interest  here  is  in 
nicely  behaved  and  easily  implementable  controllers,  in  contrast  to  optimal  control  techniques,  which 
result  in  highly  irregular  feedback. 

In  1990  work  the  Pis  proved  that  smooth  stabilization  is  always  possible.  Motivated  by  our  paper, 
Teel  showed  that  single-input  multiple  integrators  can  be  stabilized  by  feedbacks  which  are  themselves 
compositions  of  linear  functions  and  iterated  saturations.  We  were  then  able  to  extended  Teel’s  result 
to  arbitrary  linear  systems  as  above;  see  [26],  This  is  very  satisfying  mathematically:  the  control  of 
a  linear  system  involving  saturation  is  achieved  by  a  feedback  law  built  -in  a  fairly  simple  manner- 
out  of  similar  components.  Pursuing  this  research  further,  we  have  now  arrived  at  the  rather  surprising 
conclusion  that  one  may  always  use  the  simplest  possible  nonlinear  architecture,  namely  IHLnets.  (See 
[39]  for  a  summiar)';  the  full  paper  w'ill  appear  in  [1 1]).  As  an  application,  the  paper  [34]  describes  an 
explicit  example  of  control  design  for  an  F-8  aircraft  subject  to  elevator  rate  constraints.  Much  needs  to 
be  done  in  this  area,  as  our  results  provide  unacceptable  performance,  but  the  improvement  w'ith  respect 
to  linear  feedback  is  remarkable. 


Ihus.  1 HL  nets  are  found  to  be  useful  in  controlling  certain  nonlinear  systems,  which  is  consistent 
with  the  designs  proposed  in  much  of  the  neurocontrol  literature.  The  availability  of  a  simple  structure 


TwC’. tiii  uni'. crsalicy  rciulis  lor  luiictioii  appfoxiniauon  would  sconi  co  indicaLc  indL  such 


"universal"  enough  tor  the  implementation  of  controllers,  in  a  precise  sense  review'ed  later.  This  was 
first  pointed  out  in  the  naner  [2],  which 

discontinuous  activations)  are  sufficient.  (The  contribution  was  recognized  with  an  honorable  mention 
for  outstanding  paper  in  the  IEEE  Transactions  on  Neural  Networks  for  1992.) 

- - iwi  i-mb.  U.p^u.i  ciiv.  Ll^iVLlUli  14)  LliaLV-UiillUl  |JiOuiCUi4>  fcUC  CiiCilLiaUV  iii.CiilC 

problems  (one  must  solve  for  a  trajectory'  satisfying  certain  boundar'/  conditions).  For  such,  as  pointed 
out  above,  more  general  architectures  are  needed.  To  give  an  outline  of  how  this  obstruction  can  occur, 
consider  the  following  situation.  Suppose  that  thexibjective  is  to  globally  asymptotically  stabilize  a 
plana:  svstem 

X  =  f(x.u) 


witn  respect  to  the  origin  using  controllers  i!  =  k\x}  implementable  bv  (HE  nets.  It  is  ea.'S  to  crjc'e 
examples  for  w  hich  there  is  no  continuous  feedback  law  that  will  achieve  stabilization,  even  if  the 
original  system  is  very  simple.  So  no  IHL  net  with  continuous  activations  would  be  able  to  accomplish 
the  staged  objective.  But  what  about  allowing  discontinuous  cr  such  as  Heaviside  activations)’  (V.E 
ignore  tor  now  questions  ot  possible  nonexistence  of  solutions  for  ode's  with  discontinuous  right-hand 
sides:  die  problem  w  id  oc  even  more  Dasic.j  It  turns  out  that  even  then  it  may  be  impossible  to  stabilize. 
Indeed,  assume  that  we  know  some  discontinuous  feedback  law  A'oj  z )  which  stabilizes.  It  would  appear 
that  one  can  then  obtain  k[x)  simply  by  approximating  Azq.  However,  as  we  said  in  an  earlier  section,  it 
is  impossible  in  general  to  approximate  a  discontinuous  fco  uniformly  by  IHL  functions  (this  is  an  easy 
result  pointed  out  in  [2]).  A  weak  type  of  approximation  may  not  be  enough  for  control  purposes.  For 
instance,  it  may  be  the  case  that  for  each  approximant  ko  there  is  some  smooth  simple  closed  curve  F 
encircling  the  origin  where  the  approximation  is  bad  and  that  this  causes  the  vector  field  f(x,ko(x)) 
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to  point  transversally  outward  every'where  on  F;  in  that  case  trajectories  cannot  cross  F.  Thus  a  bad 
approximation  in  a  very  small  set  (even  of  measure  zero)  can  introduce  a  “barrier”  to  global  stabilization. 

The  paper  [2],  for  discrete-time,  constracts  examples  of  systems  which  are  otherwise  stabilizable 
but  such  that  every'  possible  feedback  implementableby  a  IHL  net  (with  basically  any  type  of  activation, 
continuous  or  not)  must  give  rise  to  a  nontrivial  periodic  orbit.  On  the  other  hand,  it  can  be  shown  that 
every'  system  that  is  stabilizable,  by  whatever  k,  can  also  be  stabilized  using  2HL  nets  with  discontinuous 
activations  (under  mild  technical  conditions,  and  using  sampled  control).  See  [2]  for  details. 

To  summarize,  if  stabilization  requires  discontinuities  in  feedback  laws,  it  may  be  the  case  that  no 
possible  IHL  net  stabilizes.  Thus  the  issue  of  stabilization  by  nets  is  closely  related  to  the  standard 
problem  of  continuous  and  smooth  stabilization  of  nonlinear  systems,  one  that  has  attracted  much 
research  attention  in  recent  years.  Roughly,  there  is  a  hierarchy  of  state-feedback  stabilization  problems: 
those  that  admit  continuous  solutions,  those  that  don’t  but  can  still  be  solved  using  IHL  nets  with 
discontinuous  activations,  and  more  general  ones  (solvable  w'ith  2HL).  It  can  be  expected  that  an 
analogous  situation  will  be  true  for  other  control  problems.  Perhaps  the  reason  that  most  neurocontrol 
papers  have  reported  success  while  using  IHL  nets  is  that  they  almost  always  dealt  with  feedback 
linearizable  systems,  which  form  an  extremely  restricted  class  of  systems  that  happen  to  admit  continuous 
stabilizers. 


Approximation  Rates 


Certain  recent  results  due  to  Andrew  Barron  and  Lee  Jone.s  have  been  used  to  support  the  claim  that 
( 1  HD  neural  network  approximations  may  require  less  parameters  than  conventional  techniques.  What 
LS  oj  Li'.ii  i:>  tfii'ii  iipproMi'iiaUDn^  Oi  tur.oCion:)  ifi  (Gcnncd  LV'piCtihy  ifi  hcin'nonic 

analvsis  terms  or  bv  bounds  in  suuable  Soboiev  norms)  to  within  a  desired  error  tolerance  can  be  obtained 


splines,  or  Fourier  series,  would  require  an  astronomical  number  of  terms,  especially  for  multivariate 

and  Jones.  Their  results  apply  in  principle  also  to  give  efficient  approxim.ations  with  various  types  of 
classical  basis  functions,  as  long  as  the  basis  elements  can  be  chosen  in  a  nonlinear  fashion,  }usi  as  with 
neural  netv.or.'\.s.  For  in-'tance,  splines  wnh  'va.i_>in^  (rutiiC»  muii  fi.cci-iy  uvues,  Lii^uiionicLn^  Scues 
with  adaptively  selected  frequencies,  will  have  the  same  properties.  What  is  important  is  the  possibility 
of  selecting  terms  adaptively,  in  contrast  to  the  use  of  a  large  basis  containing  many  terms  and  fitting 
these  through  the  use  of  least  squares.  Thus,  the  important  fact  about  the  recent  results  is  that  they 
emphasize  that  oalinear  parameterizations  may  require  less  parameters  than  linear  ones  to  achieve  a 
guaranteed  degree  of  approximation,  (.Abstractly,  this  is  not  so  surprising:  for  an  analogy,  consider 


iilT 

so.)  F 


weil  every  element  in  an  Euclidean  space  but  no  id  —  1  )-dimensional  subspace  can  do 
or  an  exposition  and  a  precise  theorem,  comparing  rates  of  approximiation  by  such  neural  net  and 


nonlinear  adaptation  approaches  vs.  rates  obtainable  with  classical  approximation  techniques  see  the 


'T'u  ,  n  - -  T  . 1  *  -  , . .  .  ,  ♦  1 . .  r  ..  1  .  . .  -  . :  i  •  .  t  ^  '  •  ? 
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an  result  as  well).  In  robust  estimation,  the  advantages  of  norms  different  from  quadratic  are 
w'ell-known.  This  motivated  the  work  [37],  [14],  which  established  good  rates  of  approximation  when 
measuring  errors  in  several  norms  as  well  as  limitations  of  the  “greedy”  or  “incremental”  technique 
suggested  by  Jones  in  his  study  of  neural  nets  and  projection-pursuit  estimation.  The  work  in  [37],  [14] 
is  based  on  ideas  from  the  theory  of  stochastic  processes  on  function  spaces  and  techniques  related  to 
moduli  of  smoothness  in  Banach  spaces. 
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Learning 


The  use  of  networks  for  pattern  classification  and  related  applications  makes  imperative  the  theoretical 
study  of  the  question  of  learning.  One  of  the  main  current  approaches  to  defining  and  understanding 
the  meaning  of  “learning”  is  based  on  the  probably  approximately  correct  (“PAC”)  model  proposed  in 
computational  learning  theory  by  Valiant  in  the  early  1980s.  Very  closely  related  ideas  appeared  in 
statistics  even  earlier,  in  the  work  of  Vapnik  and  Chervonenkis  and  the  interactions  between  statistics 
and  computer  science  are  the  subject  of  much  current  research;  for  a  quick  survey  of  some  basic  results, 
see  for  instance  [18]. 


In  the  PAC  paradigm,  a  “learner”  has  access  to  data  given  by  a  labeled  sample  generated  at  random, 
with  inputs  independently  and  identically  distributed  according  to  some  fixed  but  unknown  probability 
measure.  It  is  assumed  that  there  is  some  fixed  but  unknown  function  generating  the  data,  and  this 
function  belongs  to  some  known  class  of  functions  which  is  used  to  characterize  the  assumptions 
(“bias”)  being  made  about  what  is  common  among  the  observed  input/output  pairs.  The  learner  knows 
the  class  but  not  the  particular  function  generating  the  data.  (For  instance,  in  linear  dynamical  systems 
identification,  the  class  could  be  that  of  all  stable  SISO  systems  of  a  certain  McMillan  degree.)  The 
learner’s  objective  is  to  use  the  information  gathered  from  the  observed  labeled  samples  in  order  to 
guess  the  correct  function  in  this  class,  or,  more  precisely,  to  make  a  good  guess  as  to  the  correct 
output  for  unseen  inputs.  (This  can  all  be  formulated  elegantly  in  terms  of  T'  norms.  Also,  note  that 
many  variations  are  possible,  for  instance  allowing  a  “hypothesis”  which  is  in  a  different  class,  allowing 

ter  the  niv^i>w  o*i  tue  uiSiribuLiOn  tuc 

samples,  and  so  forth.)  Learuability  can  then  be  defined  m  an  infortiiaiion-iheoreiic  sense,  in  terms 
of  the  requirement  that  a  relative!)'  small  number  of  samples  be  general!)'  sufficient  to  obtain  a  good 
approximation  of  the  unknown  function,  or  in  a  com.plexin-rheoretic  sense,  requiring  that  in  addition 

f  ^  n  ■’ T  f  r*  f,-v.  p -‘‘‘m  n  1 1  k'  th  T.  t  P  f  h  '  T ’■ !  A  f  V. '  rJ  y,'  q  rt-i ,  =  M 


For  bmax)  functions  (.that  is,  ciassincation  problems)  and  the  information-theoretic  question,  the 
situation  can  be  well-characterized  in  terms  of  the  Vapnik-Cherx'onenkis  (VC)  dimension  of  the  class 
of  fianctions.  Finiteness  of  this  measure  is  necessary-'  and  sufficient  for  PAC  leamability.  For  networks 
with  discontinuous  -more  precisely  Heaviside-  activations,  finiteness  of  VC  dimension  has  been  known 
for  a  long  time,  based  on  the  work  of  Haussler  and  Baum.  However,  this  did  not  apply  to  nets  as 
implemented  in  practice,  as  differentiability  is  required  for  gradient  descent  algorithms.  Recently,  in 
the  joint  work  [33],  we  solved  the  long-standing  open  question  of  showing  finiteness  of  VC  dimension, 
and  hence  leamability  in  the  information-theoretic  sense,  for  the  more  realistic  case  of  sigmodal  nets 
using  activations  tanh  and  other  standard  activations.  The  results  are  based  on  very  recent  work  in  logic 
and  show  also  the  finiteness  of  VC  dimension  for  miultivariate  sparse  polynomials,  which  had  been  an 
open  question  in  theoretical  computer  science. 


Regarding  cornple.xity-theorctic  aspects  ut  learning,  it  'vvas  known  since  w'ork  of  Biuni  anu  R.ivesi 
several  years  ago  that  with  Heaviside  activations  the  problem  is  not  leamable  (this  am.ounts  to  showing 
NP-hardness  of  the  “loading  problem”)  but  for  continuous  activations,  and  in  particular  again  for  the 
cicnvfifions  us'fd  in  prncticc  .such  us  tunh.  the  c^ucstion  is  stiH  open.  (There  huvs  bccri  results  ur.r.ounc^d 
in  late  1993  applying  to  binary  inputs,  but  for  real-valued  innuts  the  Question  is  not  settled.)  Recent 
research  (see  [13])  has  made  partial  progress  towards  that  goal. 


Linear  Dynamics  (“Recurrent  Perceptrons”) 

We  now  turn  to  recurrent  nets,  that  is,  the  case  in  which  the  interconnection  graph  has  loops  and  so  a 
dynamic  interpretation  of  behavior  is  necessary.  A  degenerate  case  is  that  in  which  no  loop  contains  a 
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nonlinear  element  a.  Such  systems  are  described  by  cascades  of  a  linear  system  -in  the  standard  sense 
in  control  theory-  with  a  feedforward  net.  The  simplest  case  is  given  by  a  difference  or  a  differential 
equation 

+  1 )  [orx(<) ]  =  Ax{t)  +  Bu{t),  y{t)  =  ap{Cx) . 

in  discrete-  or  continuous-time  respectively  (dot  indicates  time  derivative),  where  .4  is  an  n  x  n  matrix, 
B  is  n  X  m,  and  C  isp  x  n.  Here  <7p  is  a  map  w'hich  in  each  coordinate  is  a  saturation-type  nonlinearity, 
such  as  sign,  -,  or  Os-  We  call  such  systems,  for  which  the  dynamics  are  linear  but  the  outputs  are 
subject  to  the  limitations  of  measuring  devices,  constrained-output  (linear)  systems. 

There  are  many  reasons  for  studying  such  objects  besides  neural  networks.  As  mentioned  earlier, 
there  is  a  generally  recognized  interest  in  understanding  the  continuous/discrete  interface;  one  natural 
first  step  is  the  study  of  partial  (discrete)  measurements  on  the  state  of  a  continuous  dynamical  system, 
in  the  style  of  symbolic  dynamics.  One  of  the  first  questions  that  one  may  address  concerns  the  nature  of 
the  information  that  can  be  deduced  by  a  symbolic  “supervisor”  from  data  gathered  from  such  a  “lower 
level”  continuous  device,  using  appropriate  controls  in  order  to  obtain  more  information  about  the 
system.  Most  sampling  involves  some  form  of  quantization  into  a  discrete  range;  a  special  case  of  this, 
1-bit  quantization,  is  that  in  which  a  simply  takes  the  sign  of  each  coordinate,  giving  rise  to  sign-linear 
(SL)  systems.  These  were  the  focus  of  the  paper  [6]  by  one  of  the  Pis  and  a  graduate  student,  which 
provided  a  complete  necessary  and  sufficient  characterization  for  the  state-observability  of  SL  systems, 
expressed  in  terms  of  algebraic  rank  conditions  similar  to  the  linear  case.  The  results  in  that  paper  and 
related  work  by  the  same  authors  parallel  those  in  the  linear  case,  but  with  some,  perhaps  unexpected, 
differences.  For  instance  the  characterization  is  different  in  the  continuous-  and  discrete-time  cases, 
and  controllability  properties  affect  observability.  In  the  more  recent  work  [12],  the  sam.e  questions 
v.  ere  asked  for  the  case  of  oiiipui-saiuratea  is  sterrii,  triose  ror  which  in  each  cooramate  o  —  ~ ,  ihai 
is.  the  measurement  device  is  saturated  for  large  values  but  is  linear  near  zero.  In  this  case,  an  elegant 

nuL  ii  V  ti  I  (..lulc  ik>r  riic  '.a.>c  iii  '■'-ii'Lti  ‘.lici'c  Ui'c  li'iicc:  liiL’l'c  I  C  i  i  A' !  t'.T*  l  . 

Sign-linear  systems  are  also  motivated  by  pattern  recognition  applications.  Indeed,  among  the 
iViOsc  popuicir  ic^chniQucs  lor  thdc  purposed  dre  chose  bdsed  upv)n  pdrcapc/'V/is  or  iinodr  ciiscnn^iridrui. 
Mathematically,  these  are  simply  functions  of  the  type  ^  :  R*"'  — -  R^,  9[v)  =  sign  (Cr)  (sign  is  taken  in 
each  coordinate),  typically  with  k  >  p.  Perceptrons  are  used  to  classify  input  patterns  r  =  ( iq ,  •  •  • ,  I'k) 
into  classes,  and  they  form  the  basis  of  many  statistical  techniques.  In  many  practical  situations,  anising 
in  speech  processing  or  learning  finite  automata  and  languages,  the  vector  v  really  represents  a  finite 
window 

u(t  -  \  ). . .~.  .  u{t  -  s)  (31 


of  a  sequence  of  m-dimensional  inputs  u{  i ).  U[2] . where  the  components  of  (3)  have  been  listed  as 

V  (and  ,sm  =  k).  In  that  case,  the  oercentron  can  be  understoe.d  ac  a  siqn-linear  sv^tem  of  dimension  n. 


with  a  shift-register  used  to  store  the  previous  inputs  (3).  Seen  in  our  context,  perceptrons  are  nothing 
more  than  the  very  special  subclass  of  “finite  im.pulse  response”  SL  systems  (all  poles  at  zero).  As  such, 
they  are  not  suited  to  modeling  time  dependencies  and  recurrences  in  the  data.  It  is  more  reasonable  to 
eeneraiize  to  the  case  vvhere  a  convolution  be  an  arbim.-irv  reaiizabie  kernel  is  taken  befo.'e  the  sisn.  and 


this  motivated  the  introduction  of  SL  systems  in  the  engineering  and  neural  nets  literature.  (Models  of 
this  form  also  arise  in  many  other  areas  as  w-ell.  For  instance,  in  signal  processing,  when  modeling  linear 
channels  transmitting  digital  data  from  a  quantized  source,  the  channel  equalization  problem  becomes 
one  of  systems  inversion  for  systems  that  are  essentially  of  this  type,  with  quantizer  a.)  We  called  the 
resulting  input/output  behaviors  sign-linear  i/o  maps.  For  these,  sign-linear  systems  constitute  the  most 
obvious  class  of  state-space  realizations.  Thus  one  is  interested  in  questions  such  as  minimal  dimension 
representations  and  the  existence  of  Nerode-canonical  (i.e.,  reachable  and  observable)  realizations. 
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The  paper  [6]  treated  such  issues,  and  in  particular  showed  that  canonical  realizations  admit  a  natural 
structure  as  cascades  of  linear  systems  and  finite  automata. 

Recurrent  Nets:  Models 

The  general  case  of  recurrent  nets  considered  in  this  report  consists  of  those  systems  evolving  in  R"' 
according  to  equations  of  the  type 

+  1 )  [ori(t )  ]  =  UniAx  E  Bu) .  y  —  Cx  .  (4) 

Any  such  system  is  specified  by  providing  a  triple  of  matrices  .4.  B,  C  and  the  activation  function  a. 
We  use  notations  consistent  with  linear  systems  theory  (when  a  is  the  identity),  that  is,  ,4,  B,  and  C  are 
respectively  real  matrices  of  sizes  n  x  n,  n  x  m  and  p  x  n.  See  the  block  diagram  in  Figure  5,  where 
Ax  =  x"^  or  =  X  in  discrete  or  continuous  time  respectively: 


it;  - - ,  .  o  -  . 

i  IVCC-tt/ZC/t*  i  t  Cl 


X  =  —Dx  +  a  (Ax  +  Bu) 


with  D  a  diagonal  matrix  (for  “Hopfield  nets"  one  picks  in  addition  .4  symmetric).  Also,  the  input  term 
Bu  may  be  outside  the  nonlinearity.  For  purposes  of  this  report,  we  restrict  attention  to  the  form  (4), 
but  [7]  shows  how  to  transform  among  different  models. 


Recurrent  nets  have  been  studied  for  a  long  time,  and  appeared  early  on  in  the  work  of  Hopfield 
as  well  as  among  the  models  considered  by  Grossberg  and  his  school.  They  are  sometim.es  interpreted 
as  representing  the  evolution  of  ensembles  of  n  “neurons,”  where  each  coordinate  x,-  is  a  real-valued 

variable  which  represents  the  intern:;!  state  of  the  ;th  neuron,  and  each  1 . n;,  is  an  externa! 

input  signal.  The  coordinates  of  tj(t)  represent  the  output  of  p  probes,  or  measurement  devices,  each  of 

■Mpfilpfi  fVixi  nf  rv?  o-n  ^  -  t->,  ear  i  t-t  r  Tt-«  ^  i  7  i  M  7 T".  r  :  v  " 

on  .some  coordinates,  tha:  is,  ihe  components  of  y  are  sim.ply  a  subset  of  the  components  of  ,r.) 


With  feedback,  one  may  exploit  context-sensitivity  and  memory,  characteristics  essential  in  the 
modeling  and  control  of  processes  involving  dynamical  elements.  Recurrent  networks  have  been 

employed  in  the  design  of  control  laws  for  robotic  manipulators  t  Jordan),  as  well  as  m  speech  recognition 

r  a  r., — . .  -_j  . . 

. . .  “r - - - - -  - 1,...  ......  ....lo 

extrapolation  for  time  series  prediction  (Farmer).  In  signal  processing  and  control,  recurrent  nets  have 
been  proposed  as  generic  identification  models  or  as  prototype  dynamic  controllers.  In  addition,  as 
discussed  later,  recent  theoretical  results  about  neural  netw'orks  established  their  universality  as  models 
for  systems  approximation  as  well  as  analog  computing  devices. 


Electrical  circuit  implementations  of  recurrent  nets,  employing  resistively  connected  networks  of 
n  identical  nonlinear  amplifiers,  with  the  resistor  characteristics  used  to  reflect  the  desired  weights. 
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have  been  suggested  as  analog  computers,  in  particular  for  solving  constrained  optimization  problems 
and  for  implementing  content-addressable  memories.  Special  purpose  analog  chips  are  being  built  to 
implement  recurrent  nets  directly  in  hardware. 

However,  and  in  spite  of  their  attractive  features,  recurrent  networks  have  not  yet  attained  as  much 
popularity  as  might  have  been  expected,  compared  with  the  feedforward  nets  so  ubiquitous  in  current 
applications.  Perhaps  the  main  reason  for  this  is  the  fact  that  training  (“learning")  algorithms  for  recurrent 
nets  suffer  from  serious  potential  limitations.  The  learning  problem  is  that  of  finding  parameters  that 
“fit”  a  general  form  to  trciining  (experimental)  data,  with  the  goal  of  obtaining  a  model  which  can 
subsequently  be  used  for  pattern  recognition  and  classification,  for  implementation  of  controllers,  or 
for  extrapolation  of  numerical  values.  Various  learning  methodologies  for  recurrent  networks  have 
been  proposed  in  the  literature,  and  have  been  used  in  applications.  All  algorithms  are  based  on  an 
attempt  to  achieve  the  optimization  of  a  penalty  criterion  by  means  of  steepest  descent,  but  due  to 
memory  and  speed  constraints,  they  usually  only  involve  an  estimate  of  the  gradient.  They  differ  on 
the  approximation  used,  and  thus  on  their  memory  requirements  and  convergence  behavior.  Among 
these  are  “recurrent  backpropagation,”  “backpropagation  through  time,”  and  the  “real  time  recurrent 
learning”  algorithm.  Besides  an  imperfect  approximation  of  the  gradient,  there  is  of  course  the  issue 
of  spurious  local  minima,  as  in  the  feedforward  case.  Part  of  this  report  deals  with  a  new  and  totally 
different  approach  -based  on  recent  theoretical  results-  for  the  identification  of  recurrent  models.  This 
brings  us  to  the  next  point. 


Parameter  Identification.  Observability.  Controllability 


As  tor  food ror'vkHird  nets,  o.ne  ctin  D.sk  cidout  iip.ioueness.  In  [/]  (for  continucus-timo)  nnd 

;  one  oi  toe  Pis  ond  o.  sraduoie  StuderiL  SLudicd  to  Vr'hat  extent  does  the  function  oi  tne 


IS,  the  entries  of  the  matnees  .4,  B,  and  C.  A  more  precise  formulation  is  as  follows.  Assume  that  the 
network  is  stained  at  the  relaxed  state  .t( 01=0  tnoneouilibnum  initial  states  are  the  subject  of  a  paner 
in  preparation)  and  an  input  signal  u(-)  is  applied.  Let  x(-)  be  the  solution  of  (4)  -in  continuous-time, 
assume  that  a  is  globally  Lipschitz  so  x(t)  is  well-defined  for  ail  times-  and  let  y{t)  =  Cx{t)  be  the 
output  signal  thus  generated.  In  this  manner,  for  each  triple  I.  A.  B.C)  one  defines  an  input-output 
mapping  A(a.b.C}  '■  ttf-)  i—  y(-)-  The  formal  question  is  to  what  extent  are  the  matrices  A.B.C 
determined  by  the  i/o  mapping  g  ^-j. 


In  the  very  special  case  when  a  is  the  identity,  classical  linear  realization  theory  implies  that, 
genencaiiy.  the  tnple  t  A.B.C)  is  determined  only  up  to  an  invertible  change  of  variables  in  the 
state  space.  Tnat  is,  except  for  degenerate  situations  that  arise  due  to  parameter  dependencies  (non- 


>  Iff,. 


/  \ 


Oi 


ivior  then  there  is  .an  invertible  mat.rix  1  such  that  I  ‘  .11  =  .4,  T  '  B  =  B,  and  Cl 


n  n  \ 

i->  .  V.  ) 

1 


.  J  /  < 


P 


'.  This  is 


the  same  as  saying  that  the  two  systems  are  equivalent  under  a  linear  change  of  variables  x{t}  —  Tx{t). 
Conversely,  still  in  the  classical  case  cr  =  identity,  any  such  T  gives  rise  to  another  system  with  the  same 


- 


rtin. 


vith  a.nv  gi 


;n  triple  ( .4.  B.  C). 


U- I  ^  ^  1  I . .  1..  ,..l _  :  -  '  ■  .  ..  T..  r-n  ...  »  t- 

- - -  ^  \jiiij  WiiCif  yj  to  All  L/J  vve  iiiuwcu  tilUL  IVJi  /ityfu  illCua 

activations  -under  a  mild  nonlinearity  axiom-  the  natural  group  of  symmetries  is  far  smaller  than  that  of 
arbitrary  nonsingular  matrices.  It  is  instead  just  a  finite  group,  the  “sign-permutation  subgroup”  given 
by  the  action  of  matrices  T  as  above  but  having  the  special  form  of  a  permutation  matrix  composed  with 
a  diagonal  matrix  performing  at  most  a  sign  reversal  at  each  neuron.  That  is,  the  input/output  behavior 
uniquely  determines  all  the  weights,  except  for  a  reordering  of  the  variables  and,  for  odd  activation 
fimetions,  sign  reversals  of  all  incoming  and  outgoing  weights  at  some  units.  As  for  linear  systems. 
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suitable  genericity  conditions,  given  by  certain  simple  algebraic  inequalities,  are  assumed.  (The  paper 
also  shows  that  it  is  possible  to  determine  a  itself  from  the  i/o  data  as  well.)  This  result  has  the  surprising 
implication  that  a  dimensionality  reduction  of  the  parameter  space  -as  is  the  case  with  linear  systems, 
where  canonical  forms  are  central  to  identification  methods-  is  not  possible  for  neural  nets.  Moreover, 
the  proof,  based  on  certain  computations  with  vector  fields,  is  constmctive,  so  it  suggests  a  technique 
in  principle  for  obtaining  parameters  from  i/o  data,  totally  different  from  numerical  methods  based  on 
mismatch  minimization;  more  on  that  subject  later. 

A  recent  variation  using  exponentials  as  activation  functions  was  pursued  as  part  of  Renee  Koplon’s 
just-completed  Ph.D.  dissertation,  in  work  supported  under  this  project  (cf.  [43],  [42]);  the  thesis 
included  a  discussion  of  a  simple  MATLAB  implementation  of  a  realization  algorithm. 

Other  “system  theoretic”  questions  can  be  studied  for  recurrent  nets  as  well.  For  instance,  the  recent 
work  in  [8]  looked  at  questions  of  observability^  that  is,  state  distinguishability  for  a  known  system,  as 
opposed  to  determination  of  the  systems  parameters  with  a  known  initial  state.  This  issue  is  central 
to  the  constmction  of  estimators  as  well  as  in  further  understanding  minimality.  The  main  result  was 
that  observ'ability  can  be  characterized,  if  one  assumes  certain  conditions  on  the  nonlinearity  and  on  the 
system,  in  a  manner  very  analogous  to  that  of  the  linear  case.  Recall  that  for  the  latter,  observability 
is  equivalent  to  the  requirement  that  there  not  be  any  nontrivial  A-invariant  subspace  included  in  the 
kernel  of  C.  Surprisingly,  the  result  generalizes  in  a  natural  manner,  except  that  one  now  needs  to 
restrict  attention  to  certain  special  “coordinate”  spaces.  More  recent  work  concerns  the  simultaneous 
identification  of  parameters  and  initial  states.  This  and  other  questions,  such  as  the  characterization  of 
controiia'oiiity  properties,  are  tne  suoject  of  ongoing  ana  piannea  research  by  me  Pis. 


language  of  theoretical  computer  science.  The  topic  is  the  exploration  of  the  ultimate  capabilities  of 
recurrent  nets  viewed  as  analog  computins  dex-ices.  This  area  is  a  fascinating  one.  but  very  difficult 
to  approach.  Part  of  the  problem  is  that,  much  interesting  work  notwithstanding,  analog  computation 
is  hard  to  model,  as  difficult  questions  about  precision  of  data  and  readout  of  results  are  immediately 
encountered  -see  the  many  references  in  our  papers  cited  below. 

In  a  recent  series  of  papers  as  well  as  ongoing  research  by  one  of  the  Pis  and  a  recently-finished 
graduate  student  in  the  project  -various  aspects  are  covered  in  [5],  [9],  [10],  [28],  [30],  and  [36]-  we  took 
the  point  of  view  that  artificial  neural  nets  provide  an-opportunity  to  reexamine  some  of  the  foundations 
of  analog  com.putation  from  the  new  perspective  afforded  by  an  extremely  simple  yet  surprisingly  rich 
model,  in  a  context  where  techniques  from  dynamical  systems  theory'  interact  naturally  with  more 

''tP-Tldlird  TlOtinn^;  frorn  S’.ich  ?  V.T'’"!-'  d 

results  on  deterministic  versus  nondeterministic  computation,  and  reiated  the  study  to  standard  concepts 
in  complexity  theory. 

One  of  the  m.ost  unexpected  conclusions  was  that,  at  least  within  the  formalism  of  analog  computation 
described  there,  recurrent  neural  nets  'dtc  duniversal  model,  in  much  the  same  manner  as  Turing  machines 

nrp  for  rlnsCionl  Hirntnl  oomnnfmrr  Tt  ^  -IT.  2. 

precisely  defined-  could  ever  have  more  power -again  in  a  precisely  defined  sense-  up  to  polynomial 
time  speedups.  Particularly  satisfying  theoretically  is  the  fact  that  the  most  natural  categories  for  the 
values  of  weights  correlate  perfectly  with  different  natural  subclasses  of  computing  devices,  and  can  be 
summarized  in  an  extremely  elegant  fashion  (see  table  below). 

Another  unanticipated  -and  intriguing-  conclusion  was  that  the  class  NP  of  nondeterministic 
polynomial-time  digital  computation  is  not  included  in  what  can  be  computed  in  polynomial  time 
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with  analog  devices  (this  is  proved  under  standard  assumptions  of  the  “Pt^NP”  type).  Thus  the  solution 
of  combinatorial  problems  using  analog  devices  may  be  subject  to  the  same  ultimate  computational  ob- 
stmctions.  for  large  problem  sizes,  as  with  digital  computing.  There  are  also  many  direct  consequences 
of  the  results  that  are  immediate  but  nonetheless  interesting  on  their  own  right.  For  example,  the  problem 
of  determining  if  a  dynamical  system  x(t  +  1)  =  an{Ax(t))  ever  reaches  an  equilibrium  point,  from 
a  given  initial  state,  is  shown  to  be  effectively  undecidable  (at  least  for  g=x).  (In  Hopfield-type  nets, 
when  dealing  w'ith  content-addressable  retrieval,  the  initial  state  is  taken  as  the  “input  pattern”  and  the 
final  state,  if  there  is  convergence  in  finite  time,  as  a  class  representative.) 

Of  course,  there  had  been  much  work  before  ours  concerning  the  computational  power  of  “neural 
networks”  of  various  types,  starting  at  least  with  the  classical  McCulloch-Pitts  work  in  the  early  1940s, 
and  continuing  since.  The  new  aspect  of  our  work,  which  we  believe  is  more  appealing  and  which  allows 
obtaining  precise  and  strong  mathematical  results,  is  that  we  do  not  assume  a  separate  potentially  infinite 
“storage”  device  (such  as  tapes  in  Turing  machines).  This  memory  is  instead  encoded  in  real-valued 
signals  inside  the  system.  Related  to  ours,  and  to  a  great  extent  our  initial  motivation,  was  the  work  by 
Jordan  Pollack  on  “neuring  machines,”  but  technically  our  results  are  completely  different,  both  in  their 
emphasis  on  efficient  computation  and  in  the  general  results.  Perhaps  the  work  closest  in  spirit  is  that 
on  real-number-based  computation  started  by  Blum,  Shub,  and  Smale.  In  contrast  to  that  line  of  work, 
however,  we  do  not  assume  that  discontinuous,  infinite  precision  “if-then-else”  decisions  are  possible  in 
our  model,  nor  can  discrete  results  be  read  out  of  the  system  through  infinite  precision  measurements. 
Thus  our  model  is  more  restricted  than  the  one  used  by  Blum  et  al.  On  the  other  hand,  one  might 
rcascnabl)  Vvant  to  rojtnct  tcic  iiiOucIi  even  inore,  tor  uiitancc  to  account  for  noise,  and  tnis  is  a  topic 
for  further  reseaxeh. 


ill 


le  more  exriicu.  vve  scuaiec-  m  our  '-vor' 


cuo-iirrie  systerns  circ  or  inccreiL,  ii  is  eusy  lo  see 


finite  automata.)  Our  main  results  — after  suitable  precise  definitions —  are  summarized  as  follows, 
stated  for  simplicity  in  terms  of  formal  language  recognition; 


1  Weights 

Capability 

"Poiytim"e"| 

1  _ _ 

a  nuclei 

regular 

regular  a 

f  rational 

recursive 

(usual)  P  1 

arbitrar)' 

analog  P  | 

More  speci.hcaily.  ( 1 )  w  nh  integer  matrices  .4.  B,  and  C\  one  obtains  just  the  power  of  finite  automata: 
(2)  with  rational  weights  recurrent  networks  are  computationally  equivalent,  up  to  a  constant  time 
■Spcchup.  .lUnriu  ij'iUL'iiiiics;  aiid  wun  rcai  partirncters,  ail  possioic  iunsuages.  or  noi,  ur^ 

"computable."  but  when  imposing  polynom.ial-time  constraints,  the  class  that  results  is  analog-P,  for 
analog  polynomial  time. 


In  132 1  u  1 's  shnv^'n  how  to  bv  n^tv/o^^cs  Vr’ci'^bts  i"  ccrtciiu 

Koimogorov-complexity  definable  classes.  It  is  impossible  in  this  brief  overview  to  explain  the  de- 
of  tli'j  ueriuiuuiis  and  cuustruciions  in  liie  above  series  oi  papers.  Out  it  should  be  empnasized 
that  a  central  role  is  played  by  a  Cantor-like  representation  of  internal  values,  which  by  virtue  of  its 
self-similarity  under  scaling  allows  easy  updates,  and  which  because  of  its  gaps  allows  finite-precision 
decisions  and  measurements.  Another  feature  is  the  equivalence  between  “analog-P”  and  the  class  P/poly 
studied  by  Karp  and  Lipton,  of  nonuniform  polynomial-size  circuits  or  equivalently  sparse-oracle  Turing 
Machines. 
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3  Some  More  Details 

This  part  of  the  report  provides  a  few  more  details  on  some  of  the  topics  discussed  earlier.  Due  to  lack 
of  space  much  must  be  omitted,  so  references  to  the  literature  are  given  for  most  fine  technical  points. 

Feedforward  Nets 

We  first  introduce  some  notation  and  terminology.  We  let  Tn,a,m  be  the  set  of  functions  /  :  R'"  —  R 
computable  by  a  IHL  net  with  n  hidden  units  and  activation  a,  that  is,  those  that  can  be  expressed 
as  in  Equation  (1),  and  write  /"o-.m  •'=  Un>o  The  set  :=  of  functions  whose 

coordinates  are  in  can  be  thought  of  as  the  set  of  all  /  :  R'"  —  R^  computable  by  IHL  nets  with 
multiple  outputs. 

Let  be  a  set,  to  be  called  the  input  set,  and  let  y  be  another  set,  the  output  set.  In  the  discussion 
below,  for  IHL  nets,  U  =  R"*.  To  measure  discrepancy  in  outputs,  it  may  be  assumed  that  y  is  a 
metric  space.  For  simplicity,  assume  from  now  on  that  =  R,  or  3^  is  the  subset  {0,  1},  if  binary  data 
is  of  interest.  A  labeled  sample  is  a  finite  set  5  =  {(ui,  ?/i ),...,  y^)},  where  u\, . .  .,Us  6  and 

yi ,  •  •  •  ^  t/s  G  3'  •  (The  yds  are  the  “labels;”  they  are  binary  if  yi  £  {0,  1 }.)  It  is  assumed  that  the  sample 
is  consistent,  that  is.  u,  =  uj  implies  y,-  =  y^.  A  classifier  is  a  function  F  '.U  —  3-  The  error  of  F  on 
the  labeled  sample  S  is  defined  as 


FAF.S)  :=  VfFluA-  y^A 

i=i 

A  set  F  of  classifiers  will  be  called  an  architecture.  Tvpicallv.  and  below. 


:  t ! 


i 


1C  CAf  of 

that  of  nets  with  1  inputs,  p=l 
set  has  dimension  r=3n+ 1 .  The 


outputs,  and  n  hidden  units,  that  is,  :=  JF^.-.i ;  here  the  parameter 
sample  5  is  loadable  into  T  iff 


mf  E[F.S)  =  0. 
Fetr 


Note  that  for  a  binary  sample  S  and  a  binary  classifier  F,  E{F.  5')  just  counts  the  number  of  misclassi- 
ncat’on',  so  in  the  binary  case  ioadability  corresponds  to  being  able  to  obtain  exactly  the  values  y,  by 
suitable  choice  or  /  £  F.  For  continuous-valued  yds,  Ioadability  means  that  values  arbitraniy  close  to 
tno  cp.r"  be  obt?.ir?cd. 

One  may  define  the  capacity  c{F)  of  F  via  the  requirement  that: 

c{F)  >  K  iff  every  5  of  cardinality  k  is  loadable. 


ihat  IS,  (o'y')  =  X  means  that  ali  finite  5  are  loadable,  and  c(Wj  =  k  <  cc  means  that  each  5  of 

^  ^ \  \  A  U ,  ,  ^  ^  ^  C  C  ^  ^  ^  A  \  i. .  -  .  .  ■  1  !  -  -  c  /  O  .  .  t  -  1,  ,  ,  .  .  1  I  O  ' 

- — . i  1  lo  vjl-liCl  UeiLUlUl 

of  capacity  measures  are  possible,  in  particular  the  VC  dimension  mentioned  below  as  well  as  one  in 
terms  of  “generic”  Ioadability;  see  [3].) 


Various  relations  between  capacity  and  number  of  neurons  are  known  for  nets  with  one  hidden  layer 
and  Heaviside  or  sigmoidal  activations.  It  is  an  easy  exercise  to  show  that  the  results  are  independent 
of  the  input  dimension  m,  for  any  fixed  activation  type  a  and  fixed  number  of  units  n. 
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In  the  case  m  =  p  =  1,  parameter  counts  are  interesting,  so  that  case  will  be  considered  next. 
Observe  that,  for  IHL  nets  with  one  input,  and  n  hidden  units  (and  p=l),  there  are  3n  +  1  parameters 
(appearing  nonlinearly),  though  for  'H,  effectively  only  2n  +  1  matter.  (In  the  case  of  the  standard 
sigmoid,  a  Jacobian  computation  shows  that  these  parameters  are  independent.)  For  classification 
purposes,  it  is  routine  to  consider  just  the  sign  of  the  output  of  a  neural  net,  and  to  classify  an  input 
according  to  this  sign.  Thus  one  introduces  the  class  KiJ^n.a )  consisting  of  all  {0,  1  }-valued  functions 
of  the  form  'H(f{u))  with  /  G  Tn,a-  Of  interest  are  scaling  properties  as  rr  —  co.  Let 

CLSF(cr)  ;=  liminfc(.F')/r(J^) 

n — oo 


IorT  =  n{T^ 


and 


INTP((t)  :=  liminfc(T')/r(T') 


for  IF  =  Fn^a-  These  quantities  measure  (asymptotically)  the  ratio  between  the  capacity  (in  the  sense 
just  defined)  and  the  number  of  parameters.  For  classification,  it  is  shown  in  [3]  that  CLSF(?f)  =1/3 
and  that  CLSF((7)  >  2/3  for  any  a  which  has  a  nonzero  derivative  at  some  point  and  is  sigmoidal, 

meaning  technically  now  that  both  limy _ x.  cr(u)  and  limu_+.^  cr(?i)  exist  and  are  distinct.  (It  is  also 

shown  there  that  if  one  allows  “direct  i/o  connections,”  that  is  to  say,  a  linear  term  added  to  (1),  then 
CLSF(7f)  doubles  to  2/3,  which  is  somewhat  surprising.)  The  sigmoidal  bound  is  best  possible,  in  the 
sense  that  for  the  piecewise  linear  -  one  has  CLSF(Tr)  =  2/3  while  it  is  the  case  that  CLSFjcr)  =  .cc'  for 
even  some  real-analytic  sigmoidals  a .  Regarding  continuous-valued  interpolation,  it  is  shown  in  the 
same  paper  that  intp{  7-I)  =  1/3,  and  lNTP(fxl  >  2/3  if  a  is  as  above  and  also  continuous.  It  is  also 
shown  that  lNTP(a )  =  2/3  and  that  2/3  <  INTP(as)  <  1  for  the  standard  sigmoid  (the  proof  of  the 
UDp6r  bound  in  thus  hittorco^so  soni:?  urscniwtr';',  the  vidue  mci’/  be  innrite  fo^ 

more  2.rbitrom.  e\en  oncilytic.  si^moids).  incso  resuKs  shovv  the  uduiLioncil  povver  oi'  Si^iuOiudii  over 


Of  course,  one  could  use  many  different  approaches  to  loadability  than  the  idea  of  numerically 

as  an  entropy  measure,  as  also  suggested  in  the  neural  nets  literature).  Blum  and  Rivest  showed  that  any 
possible  neural  net  learning  algorithm  based  on  fixed  architectures  faces  severe  computational  barriers, 

L.w  ,  r  (...  .ur ,  L  ^ o  ut.  jjiuuiciu.  UU  Ultft 

exist  weights  that  satisfy  the  desired  objective?  as  a  decision  question  in  computational  complexity. 
This  issue  appears  also  in  the  algorithmic  pan  of  PAC  learning  theory.  For  fixed  input  dimension  and 
Heaviside  activations,  loadability  becomes  essentially  a  simple  linear  programming  question,  but  for 
such  activations  the  problem  is  NP-hard  when  scaling  with  respect  to  the  number  of  input  or  output 
dimensions,  as  shown  in  their  work.  In  view  of  neural  network  practice,  a  more  relevant  result  to 

n/Hr  n  T  1 1 1  f  J  1  t  mp  itqt  ^ 
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the  complexity  of  the  loading  problem  when  using  sigmoidal  activations.  From  an  example  in  [3],  it  is 
known  that  the  problem  need  not  be  NP-complete,  for  certain  a.  But  for  the  standard  sigmoid  cTj  the 
matter  is  still  open.  If  it  were  the  case  that  the  problem  can  be  solved  in  polynomial  time,  this  miay 
indicate  that  perhaps  there  is  no  need  for  complex  searches  in  feedforward  learning  problems.  It  is 
unhkelv  that  this  orohlem  will  be  efficif^ntlv  rlf^eidaFi^^  in  cr.^n^rni  bni;  with  bounded  in^u*'  d’mers'^u 
may  very  well  be  the  case.  As  an  intermediate  step,  one  of  the  Pis  and  coworkers  were  recently  able  to 
extend  the  Blum-Rivest  result  to  tt,  showing  the  NP-completeness  of  the  loading  problem  for  nets  with 
two  hidden  units  and  activation  rr. 

We  now  turn  to  some  more  details  about  the  uniqueness  question,  which  was  mentioned  in  the  above 
discussion  and  also  in  some  detail  in  the  Overview  part  of  the  report  (the  remarks  made  there  will  not 
be  repeated). 
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More  on  Uniqueness 


Consider  IHL-nets,  indicated  by  the  data  I  ;=  ( 5,  C,  b,  cq,  a)  (omitting  a  if  obvious  from  the  context) 
and  corresponding  IHL-computable  functions  /(u)  =  behy  =  Cdn{Bu  +  (>)  +  cq  as  in  Equation  (1). 
We  assume  for  simplicity  that  rr  is  an  odd  function.  Two  networks  E  and  E  are  (input/output)  equivalent, 
denoted  E  ~  E.  if  behy  =  behj-  (equality  of  functions).  The  question  to  be  studied,  then,  is:  to  what 
extent  does  E  ~  E  imply  E  =  E?  We’ll  say  that  the  function  a  satisfies  the  independence  property 
("IP”  from  now  on)  if,  for  every  positive  integer  I,  nonzero  real  numbers  a\, .  .  .,ai,  and  real  numbers 
bi,. .  ..bi  for  which  the  pairs  (a,,  6,),  i  =  U.  ...I  satisfy  (a,,  6,)  ^  ±{aj,bj)  Vi  ^  j,  it  must  hold  that 
the  functions 

I  ,  a{a\u  +  bi) ,  . . .  ,  a{aiu  +  6;) 


are  linearly  independent.  The  function  a  satisfies  the  weak  independence  property  (“WIP”)  if  the  above 
linear  independence  property  is  only  required  to  hold  for  all  pairs  with  b{  =  0,  i  =  1 , . . . , 

Let  E(i?,  C,  b,  cq,  a)  be  given,  and  denote  by  B,  the  ith  row  of  the  matrix  B  and  by  Ci  and  6,-  the 
ith  entries  of  C  and  b  respectively.  With  these  notations,  beh£(u)  =  cq  +  1]"=!  oaiBiU  +  bi) .  Call 
E  irreducible  if  the  following  properties  hold:  (1)  c,  7=  0  for  each  i  =  1 , . .  . .  n;  (2)  5,  7=  0  for  each 
i  =  l,...,n;  and  (i?i,6,)  7^  ±(Bj.bj)  for  all  i  7^  j.  Given  l.(B.C,b,CQ.cr),  a.  sign-flip  operation 
consists  of  simultaneously  reversing  the  signs  of  c,-,  Bi,  and  6,,  for  some  i.  A  node-permutation  consists 
of  interchanging  (Cj,  5,-,  6,)  with  {Cj,  Bj.bj),  for  some  i,j.  Two  nets  E  and  E  are  sign-permutation  (sp) 
equivalent  if  r?  =  n  and  ( B.C.b.  cq)  can  be  transformed  into  (B.C,b,  cq)  by  m.eans  of  a  finite  number 
of  sign-flips  and  node-permutations.  Of  course,  sp-equivalent  nets  have  the  same  behavior  (since  o  has 
been  assumed  to  be  odd).  With  this  terminology,  the  following  holds:  Let  a  be  odd  and  satisfy' property 
IP  Assume  that  E  and  E  are  both  irreducible,  and  E  --  E.  Then,  E  and  E  are  sp-equivaleni.  A  net  E  has 
no  offsets  L  0  —  Cij  =  0  (the  terminology  "bia^ei"  or  ''thresholds"  is  sometimes  used  instead  of  offsets. 


WIP,  the  same  statement  is  true  for  nets  with  no  offsets. 


o  Ij  L*  jjL/ iiO/i ii'Ui,  v.  ir'  uOcdi  /tUi  /lULU.  UuHX  ci i''Sd i\,  li  G 

is  odd  and  infinitely  differentiable,  and  if  there  are  an  infinite  number  of  nonzero  derivatives 
then  a  satisfies  property  WIP.  Nets  with  no  offsets  appear  naturally  in  signal  processing  and  control 
apphcaticns,  as  there  it  is  often  the  case  that  oiic  requires  that  'oeh(O)  =  0,  t’nat  is,  the  zero  input  signal 
causes  no  effect,  corresponding  to  equilib.num  initial  states  for  a  controller  or  filter.  Results  in  this  case 
are  closely  related  to  work  in  the  1970s  by  Rugh  and  coworkers  and  by  Boyd  and  Chua  in  the  early 
1980s  in  control  theory.  For  analytic  cr  with  all  derivatives  at  zero  nonvanishing,  functions  computable 
bvlHL  nets' 


no  ofrset.s  are  dense  on  com.pacts  among  continuous  functions:  if  a  is  odd  and  all  odd 


derivatives  are  zero  do  not  vanish,  one  can  so  approximate  all  odd  functions. 


The  more  relevant  property  IP  is  fiu  more  restrictive.  For  obvious  examples  of  functions  not  satisfying 
IP,  tal-;e  (j[x)  =  e~ ,  any  periodic  function,  or  any  polynomial.  However,  the  most  interesting  activation 
for  neural  network  applications  is  cr(a;)  =  tanh(x),  or  equivalently  after  a  linear  transformation,  the 
stHndurd  si^rrnoid  .  in  thts  Cci'^c,  on^  of  th?  PI?  shov/od  !n  [4]  th^t  thi?  IP  property  ond  hones  ths  dssirsd 
uniqueness  statement,  does  hold.  The  proof  was  based  on  explicit  computations  for  the  particular 
functioii  taiih(x).  Ail  ahciiiaii'vc  pruui  ii  possible,  using  anaiytic  continuations,  and  aiiows  a  more 
general  result  to  be  established.  The  idea  is  that  residues  can  be  used  to  determine  the  weights. 
The  following  is  an  easy  to  check  sufficient  condition  for  IP  to  hold  (see  [17]):  a  is  a  real-analytic 
function,  and  it  extends  to  an  analytic  function  <7  :  C  — v  (E  defined  on  a  subset  D  C  €  of  the  form: 
D  =  {jlmzl  <  A}  \  {201  ^o}  for  some  A  >  0,  where  Imxo  =  A  and  20  titid  zq  are  singularities,  that  is, 
there  is  a  sequence  Zn  zq  so  that  |a(2„)|  — ^  00,  and  similarly  for  zq.  It  is  easy  to  see  that  tanh,  as 
well  as  most  other  functions  that  have  appeared  in  the  neural  nets  experimental  literature  such  as  arctan. 
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satisfies  the  above  sufficient  property.  (Fefferman  has  recently  extended  these  ideas  to  several  hidden 
layers,  based  on  the  observ'ation  that  weights  in  higher  layers  can  be  found  as  iterated  accumulation 
points  of  residues;  his  -highly  nontrivial-  results  apply  only  to  the  particular  case  of  tanh,  however.) 


Two  Hidden  Layers 

As  discussed  in  the  Overview  part,  two-hidden-layer  nets  are  necessary  for  inverse  problems  and  in 
particular  for  many  control  applications.  Next  we  spell  out  some  more  details.  For  the  topics  treated 
here,  there  is  no  need  to  define  “2HL  nets”  themselves,  but  just  their  behavior.  If  a  :  R  — ^  R  is  any 
function,  and  m.p  are  positive  integers,  a  function  computed  by  a  two-hidden  layer  (“2HL”)  net  with 
m  inputs  and  p  outputs  and  activation  function  a  is  by  definition  one  of  the  type  foanogoaioh,  where 
/,  g,  and  h  are  affine  maps  (ra  -f  /  is  then  called  the  number  of  “hidden  units”).  A  function  computable 
by  a  2HL  net  with  direct  input/oulput  connections  is  one  of  the  form  Fu  -b  J{u),  where  F  is  linear  and 
/  is  computable  by  a  2HL  net.  Before  discussing  control  problems,  one  can  understand  the  necessity 
of  2HL  nets  by  means  of  a  more  abstract  type  of  question.  Consider  the  following  inversion  problem: 
Given  a  continuous  function  /  :  R^  —  R”^,  a  compact  subset  C  C  R”^  included  in  the  image  of  f, 
and  an  e  >  0,  find  a  function  o  :  R'”  —  R^  so  that  \\f((p(x))  —  a:||  <  c  for  all  x  €  C.  One  wants  to 
find  a  p  which  is  computable  by  a  net,  as  done  in  global  solutions  of  inverse  kinematics  problems  —in 
which  case  the  function  /  is  the  direct  kinematics.  Unless  we  are  restricting  to  a  compact  domain  and 
/  is  one-to-one,  or  if  /  is  very  special  (e.g.  linear),  in  general  discontinuous  functions  p  are  needed,  so 
nets  V.  ith  ccr. .mucus  c  cannot  be  used.  Fiowcver,  and  tnis  is  tne  intercstiiig  part,  estaoiishes  that  nets 
with  just  one  hidden  layer,  even  if  discontinuous  a  is  allowed,  are  not  enough  to  guarantee  the  solution 
of  all  such  problems.  On  the  other  hand,  it  is  shown  there  that  nets  with  two  hidden  layers  (using  'H 
as  the  activation  type)  are  sufficient,  for  everjv  possible  /,  C.  and  c.  Tne  basic  obstrtiction  is  due.  in 
essence,  to  the  inonossibilitv  of  nrinmYirriaflr-e;  j 


.o  t  o  n  n  rrt  v  ; 


an_',  boun^eu  poiytope,  wniie  tor  some  (non  one-to-onej  /  tne  only  possible  one-sided  inverses  p  must 
be  close  to  such  a  characteristic  function. 

Consider  now  state-feedback  controllers.  Tne  objective,  given  a  system  x  —  f(  x,u)  with  /(0. 0)=0, 
is  to  find  a  stabilizer  u=k{x),  k(0}=0,  making  a  globally  asymptotically  stable  state  of  the  closed- 
loop  system  x  =  f(x,  k(x)).  The  first  remark  is  that  the  existence  of  a  smooth  stabilizer  k  is  essentially 
equivalent  to  the  possibility  of  stabilizing  using  IHL  nets  (with  smooth  a).  (Thus  the  simple  classes 
of  systems  studied  in  many  neurocontrol  papers,  which  are  typically  feedback-linearizable  and  hence 
continuously  stabilizable,  can  be  controlled  using  such  IHL  nets.)  More  precisely,  assume  that  /  is 
twice  continuously  differentiable,  that  k  is  also  m  C~,  that  the  origin  is  an  exponentially  stable  point 
lor  x=f{  x.  k{x) ),  and  that  A  is  a  compact  subset  of  the  domain  of  stability.  Pick  any  a  which  has  the 
property  that  twice  continuously  differentiable  functions  can  be  approximated  uniformly,  together  with 
their  derivatives,  using  IHL  nets  (m.ost  interesting  twice-differentiable  scalar  nonlinearities  will  do,  as 
shown  in  the  work  of  Homik  and  others).  Then,  one  can  conclude  that  there  is  also  a  different  k,  this 
one  a  IHL  net  with  activation  a,  for  which  exactly  the  same  stabilization  property  holds.  One  only 
need.i  to  show  that  if  k„  —  k  in  C~{  h  ).  with  ali  ^■,.(0)=0 — this  last  property  can  always  be  achieved  by 
simply  considering  kfix)  —  kfiO)  as  an  approximating  sequence —  then  x=f(x.  kfix))  has  the  origin 
as  an  exponentially  stable  point  and  K  is  in  the  domain  of  attraction,  for  all  large  n.  Finally,  one  know's 
that  there  is  a  neighborhood  V  of  zero,  independent  of  n,  where  exponential  stability  will  hold,  for  all 
n  sufficiently  large,  because  /(x,A:„(x))  =  +  pnix) ,  with  A  and  with  gn{x)  being  o{x) 

uniformly  on  x  (this  last  part  uses  the  fact  that  a  approximates  in  C~(  A')).  Now  continuity  of  solutions 
on  the  right-hand  side  gives  the  result  globally  on  K. 

In  general,  smooth  (or  even  continuous)  stabilizers  fail  to  exist.  Thus  IHL-network  feedback  laws. 
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with  continuous  a.  do  not  provide  a  rich  enough  class  of  controllers.  This  motivates  the  search  for 
discontinuous  feedback.  It  is  easy  to  provide  e.xamples  where  IHL  TY-nets  will  stabilize  but  no  net  with 
continuous  activations  (hence  implementing  a  continuous  feedback)  will.  More  surprisingly,  IHLN 
feedback  laws,  even  with  TL  activations,  are  not  in  general  enough  — intuitively,  one  is  again  trying  to 
solve  inverse  problems —  but  two  hidden  layer  nets  using  TL  (and  having  direct  i/o  connections)  are 
always  sufficient.  More  precisely,  [2]  shows  that  the  weakest  possible  type  of  open-loop  asymptotic 
controllability  is  sufficient  to  imply  the  existence  of  (sampled)  controllers  built  using  such  two-hidden 
layer  nets,  which  stabilize  on  compact  subsets  of  the  state  space.  On  the  other  hand,  an  example  is  given 
there  of  a  system  satisfying  the  asymptotic  controllability  condition  but  for  which  every  possible  IHL 
stabilizer  gives  rise  to  a  nontrivial  periodic  orbit. 

Linear  Systems  with  Saturated  Actuators 

The  problem  of  stabilization  of  linear  systems  subject  to  actuator  saturation  was  introduced  in  the 
Overview  part.  As  mentioned  there,  the  paper  [11]  (summary  in  [39])  provided  a  constmction  of 
controllers  that  globally  stabilize  such  systems.  It  was  the  first  complete  result  for  this  important  kind 
of  control  problems,  and  the  only  conditions  imposed  are  the  obvious  necessary  ones,  namely  that 
no  eigenvalues  of  the  uncontrolled  system  have  positive  real  part  and  that  the  standard  stabilizability 
rank  condition  holds.  Interestingly  in  the  present  context,  it  involves  neural  nets.  As  far  as  we  know, 
this  result  represents  the  first  time  that  a  neural  net  architecture  has  been  shown  to  be  perfectly  suited 

--IK  .  ^  ^ 

to  a  OOllttVji  piOUlOili. 

The  precise  statements  are  as  follows  (a  much  more  general  result  is  given  in  the  above  references, 
but  tor  purposes  of  this  report  we  merely  state  a  simplified  version).  We  first  define  di  to  be  the  class 
of  ail  functions  r  from  R  to  R  such  that  (1)  tr  is  locally  Lipschitz,  (2)  sc[s)  >  0  whenever  s  =  0,  (3) 

a  —  {a\,  -  ■  ■ .  (7k)  of  functions  in  6',  we  define  a  set  Qn[(^)  of  functions  /  from  to  R  to  be  the  class 
of  functions  h  :  R^‘  —  R  given  by 


h{x)  =  a!cri(/i(x))  -k  aiaiihix))  ak7k{fk{x)) , 

where  /i .  •  •  • .  jk  are  linear  functions  and  c'l .  •  •  • .  a;,  are  nonnegative  constants  such  that  ai-|--  •  •  — a,::  <  1. 
(In  particular,  we  could  of  course  pick  IHL  nets  with  such  small  coefficients;  more  generally,  one  can 
have  different  activations  at  each  node.)  Next,  for  an  m-tuple  /  =  (/i,  •••,/„)  of  nonnegative  integers, 
define  \l\  —  l\  ^  For  a  finite  sequence  o  =  (cri.  ■  ■  • ,  (7|^,)  =  (crj ,  •  •  • .  .  •  •  ■ .  a™.  •  •  • .  cjT  ) 

of  functions  in  S.  we  let  ^[a]  denote  the  set  of  all  functions  A  :  R"  —  R'"  that  are  of  the  form 

{hi.  ■  ■  ■ .  hrr,,),  where  /t,  e  .  ■  ■  ■ .  7'  )  for  i  —  1 . 2.  ■  ■  m.  (It  is  clear  that  G  j(cr)  =  L„fcr)  if 

T?r=l.) 

Let  (y  >  0.  We  say  that  a  function  /  :  [0, -f  oe  )  —  R"  is  eventually  bounded  by  6  (and  w'rite 
l/i  <ev  ^).  if  there  exists  T  >  0  such  that  !/(0!  <  ^  for  i  >  T.  Given  an  r?-dimensiona!  system 
c  :  X  =  f[.r  ).  we  say  that  S  is  SIRS  (small-input  small-state)  if  for  every  r  >  0  there  is  a  h  >  0 
such  that,  if  e  :  [0.  -foe  )  —  R"  is  bounded,  measurable,  and  eventuallv  bounded  by  S.  then  every' 
solution  t  —  x{t)  of  X  =  f{x)  -f  e{t)  is  eventually  bounded  by  f .  For  A  >  0,  >  0,  we  say  that  £  is 

SISSl{  A,  a  )  if,  whenever  0  <  (5  <  A,  it  follows  that,  if  e  :  [0,  -foo)  — ^  R”  is  bounded,  measurable, 
and  eventually  bounded  by  S,  then  every  solution  of  i  =  f{x)  -f  e[t)  is  eventually  bounded  by  NS.  The 
system  is  SISSl  if  it  is  SISSl{A,  N)  for  some  A  >  0,  N  >  0.  For  a  system  x  =  f{x,  u),  we  say  that 
a  feedback  u  =  k{x)  is  stabilizing  if  0  is  a  globally  asymptotically  stable  equilibrium  of  the  closed-loop 
system  x  =  f{x,  k{x)).  If,  in  addition,  this  closed-loop  system  is  SISSl,  then  we  will  say  that  k  is 
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S IS S L-stabilizing.  The  SISS  property,  besides  being  of  interest  in  itself,  is  essential  in  allowing  a 
proof  by  induction.  For  a  square  matri.x  .4,  we  let  denote  the  number  of  eigenvalues  z  of  A  such 
that  Re  c  =  0  and  Imr  >  0,  counting  multiplicities. 

Our  main  result  is  as  follows;  it  asserts  the  existence  of  a  stabilizer  of  the  above  type,  and  also 
provides  an  estimate  on  the  number  of  “hidden  units”  needed:  Let  E  be  a  linear  system  x  =  Ax  +  Bu 
with  state  space  R”  and  input  space  R'".  Assume  that  Z  is  asymptotically  null-controllable,  i.e.  that 
all  the  eigenvalues  of  A  have  nonpositive  real  parts  and  all  the  eigenvalues  of  the  uncontrollable  part 
of  A  have  strictly  negative  real  parts.  Let  p  =  EiA).  Let  a  =  (a\ ,  ■  •  • ,  cr^)  be  an  arbitrary  sequence 
of  bounded  functions  belonging  to  S.  Then  there  exists  an  m-tuple  I  =  (/],  •  •  • ,  Im)  of  nonnegative 
integers  such  that  |/|  =  p,  for  which  there  are  S L S S p-stabilizing  feedback  u  =  —kg[x)  such  that  (a) 
kg  G  /(rt)  and  (b)  the  linearizations  of  the  closed-loop  systems  are  asymptotically  stable. 


The  paper  [34]  developed  an  explicit  design  concerning  a  longitudinal  flight  model  of  an  F-8  aircraft, 
with  saturations  on  the  elevator  rate,  and  tested  the  obtained  controller  on  the  original  nonlinear  model. 
(We  picked  relative  small  values  of  these  saturations,  to  analyze  our  control  design  under  demanding 
conditions.)  We  chose  the  F-8  example  since  all  parameters  and  typical  trim  conditions  are  publicly 
available,  and  the  model  has  been  often  used  as  a  test  case  for  aircraft  control  designs.  The  procedure 
we  followed  consisted  of  first  linearizing  about  an  operating  point  and  then  constructing  a  globally 
stabilizing  controller  for  the  resulting  linearization,  with  respect  to  this  given  trim  condition,  following 
the  steps  in  our  papers.  Finally  we  proceed  to  compare  the  performance  of  the  controller  -applied  to 
the  original  nonlinear  airplane  and  starting  reasonably  far  from  the  desired  operating  point-  with  the 
"naive  "  controller  that  would  result  from,  applying  a  linear  feedback  law  which  would  stabilize  in  the 
absence  of  saturations.  The  objective  of  the  work  was  to  show  that  the  calculations  in  the  abstract  proofs 
can  indeed  be  carried  out  exolicitlv  (thousn  this  an  exiremelv  sim.ple  case  compared  to  the  seneralitv 


of  the  result:?  in  our  papers)  a,nd  moreo'.  er,  to  study  the  performance  of  the  resulting  controller  when 


results  to  be  encouraging  and  indicating  the  usefulness  of  further  work  along  this  direction.  In  one  of 
the  cases  simulated,  the  “naive”  design  would  stabilize  as  well,  though  the  difference  in  performance 
is  quite  striking.  In  another  case,  the  “naive”  design  results  in  instability,  while  the  one  given  with  our 
method  stabilizes.  The  overall  performance  of  the  system  was  not  satisfactory  for  practical  applications 
(and  in  any  case,  the  saturation  levels  were  far  to  small  to  be  of  interest  except  perhaps  in  failure  modes), 
but  of  course  this  was  just  a  first,  simple,  effort;  much  more  work  needs  to  be  done. 


PAC  Learning  and  VC  dimension 


We  now  elaborate  some  more  on  the  topic,  already  brought  up  in  the  Overx'iew'  part,  of  learning-theoretic 
issues.  We  first  review  the  technical  framework.  The  definitions  are  as  follows  (thev  are  standard,  but  the 
tenrdnology  is  changed  a  bit).  An  input  space  as  well  as  a  collection  B  of  maps  U  —  {0,  1}  are  given. 
The  setZ-f  is  assumed  to  be  countable,  or  an  Euclidean  space,  and  the  maps  in  T  are  assumed  to  be  Borel 
measurable.  In  addition,  mild  regularity  assumptions  are  made  which  insure  that  all  sets  appearing  below 
are  measurable  (details  are  omitted).  Let  IT  be  the  set  of  all  sequences  tr  =  [it],  f[ui)), .  .  ..[11^.  /[I’s]) 
over  all  s  >  1 ,  (  ui , . . . .  Us)  G  ID,  and  /  G  ZF.  An  identifier  is  a  map  cr  ;  IT  —  T .  The  value  of  c;  on  a 
sequence  such  as  the  above  is  denoted  as  pu.’-  The  error  of  p  with  respect  to  a  probability  measure  P 
on  U,?m  f  ^  T ,  and  a  sequence  («i, . . G  ,  is  Eit(P,  /,  ui, . . . ,  u^)  :=  Prob  [(p^(ti)  -f  f{u)], 
where  the  probability  is  being  understood  with  respect  to  P.  (This  is  just  X'  distance  between  two 
binary  functions.)  The  class  T  is  (uniformly)  learnable  if  there  is  some  identifier  p  with  the  following 
property:  For  each  £,S  >  0  there  is  some  s  so  that,  for  every  probability  P  and  every  f  E  IF , 

Prob[Err(P, /,ui,...,Us)  >  e]  <  ^ 
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(where  the  probability  is  being  understood  with  respect  to  on  U^).  In  the  leamable  case,  the 
function  s(c,  6)  which  provides,  for  any  given  e  and  S,  the  smallest  possible  s  as  above,  is  called  the 
sample  complexity  of  the  class  T .  If  leamability  holds  then  s(£,(5)  is  automatically  bounded  by  a 
polynomial  in  l/£  and  l/<5,  in  fact  by  -(c/c)  log(<5£),  where  c  is  a  constant  that  depends  only  on  the 
class  T .  If  there  is  any  identifier  at  all  in  the  above  sense,  then  one  can  always  use  the  following  naive 
identification  procedure:  just  pick  any  element  /  which  is  consistent  with  the  observed  data.  This 
leads  computationally  to  the  loading  question  discussed  earlier.  In  the  statistics  literature  this  “naive 
technique”  is  a  particular  case  of  what  is  called  empirical  risk  minimization. 

Checking  leamability  is  in  principle  a  difficult  issue,  and  the  introduction  of  the  following  combina¬ 
torial  concept  is  extremely  useful  in  that  regard.  A  subset  S  —  {ui , . . . ,  Us}  of  ZY  is  said  to  be  shattered 
by  the  class  of  binary  functions  T  if  all  possible  binary  labeled  samples  {(ui,  ?/i), . . . ,  r/j)}  (all 

t/i  G  {0,  1})  are  loadable  into  T .  The  Vapnik-Chervonenkis  dimension  WC{iF)  is  the  supremum  (possi¬ 
bly  infinite)  of  the  set  of  integers  n  for  which  there  is  some  set  S  of  cardinality  k  that  can  be  shattered 
by  T .  (Thus,  is  at  least  as  large  as  the  capacity  c(jT)  defined  earlier.)  The  main  result,  due 

to  Blumer,  Ehrenfeucht,  Haussler,  and  Warmuth,  but  closely  related  to  previous  results  by  Vapnik  and 
others  is:  The  class  IF  is  leamable  if  and  only  ifvc{lF)  <  cc.  So  the  VC  dimension  completely 
characterizes  leamability  with  respect  to  unknown  distributions;  in  fact,  the  constant  “c”  in  the  sample 
complexity  bound  given  earlier  is  a  simple  function  of  the  number  vc  {T),  independent  of  IF  itself. 


In  neural  net  classification  problems,  where  the  sign  of  the  output  of  a  net  is  used  to  decide 
membership  in  a  set,  one  is  interested  in  classes  of  the  type  F  =  'HiFn.a.m  )^  these  are  the  classifiers 
implementable  by  IHL  nets  with  with  n  hidden  units  and  m  inputs  (n  and  m  fixed)  and  activation 
a.  The  best-known  result  is  a  by  now  classical  one  due  to  Baum  and  Haussler,  which  asserts  that 
yc{F)  <  1  ^  log(n)),  for  the  choice  r  =  TY,  where  d  is  a  small  constant.  Thus  IHL  nets  with 

threshold  activations  are  leamable  in  the  above  sense.  In  practice,  however,  one  uses  differentiable  a. 


vC'u  c/nwo,  ivji  v%iucii 


the  VC  dimension  of  the  class  'H[Fn,c.m )  is  infinite;  see  [3].  For  some  time,  the  question  of  finiteness 


of  VC  IF)  for  F  =  FJ F~  - )  and  r  —  <7^. 


| ant-n 1  ^ V.’hST 


the  standard  architectures,  was  open.  For  bounded  weights  there  were  results  by  White,  Haussler  and 
others,  but  VC  bounds  depend  on  the  assumed  bounds  on  weights.  The  question  was  recently  settled 
theoretically  in  [33],  which  showed  that  indeed  VC(T')  <  oc,  for  activation  Thus  sigmoidal  nets 
appear  to  have  some  special  properties,  vis  a  vis  other  possible  more  general  parametric  classes  of 
functions,  at  least  from  a  leamability  viewpoint.  The  paper  [33]  provided  several  other  results  as  well, 
including  finiteness  of  the  Pollard-Haussler  pseudodimension,  which  gives  also  leamability  of  real¬ 
valued  as  opposed  to  merely  binary'  functions,  and  results  on  “teaching  dimension”  related  to  a  different 
paradigm  of  learning  introduced  by  Goldman  and  Kearns,  The  proof  VC  dimension  result  in  [33]  hinged 
on  recent  techniques  from  model  theory/,  combining  recent  results  of  Wilkie  on  order-minimality  with 
work  bv  Laskowski. 


Recurrent  Nets:  Identifiability,  Observability 

Wp  flow  to nn  f n  f 1  TT^n^  or  tn  CCnt’nUO”S  t’ TTl'^ 

and  observation  map  of  such  systems  can  be  described  by  a  discrete  or  continuous  time  system  in  the 
usual  sense  of  control  theory,  with  the  special  form  of  the  difference  or  differential  equations  as  in 
Equation  (4),  that  is,  x'^  (  or  x)  =  a{Ax  +  Bu),  y  =  Cx.  As  before,  we  use  the  superscripts  “-f” 
and  to  denote  time-shift  and  time-derivative  respectively,  and  we  omit  the  time  arguments  t.  For 
the  continuous-time  case,  we  assume  that  the  function  a  is  globally  Lipschitz,  so  that  the  existence  and 
uniqueness  of  solutions  for  the  differential  equation  is  guaranteed  for  all  times. 
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A  system  of  the  type  in  Equation  (4)  is  specified  by  the  data  Z  ;=  (A,  B,  C,  a);  as  in  the  feedforward 
case,  we  again  omit  a  if  obvious  from  the  context.  (One  could  add  constant  terms  and  consider  more 
general  systems  (  or  i)  =  a{Ax  +  Bu  +  b),  y  =  Cx  +  cq.  Such  more  general  systems  can  be  reduced 
to  the  above  case,  by  adding  a  state  which  has  a  constant  value.  Note  however  that  this  transformation 
means  that  one  cannot  take  the  initial  state  as  zero  while  keeping  the  above  form,  and  this  in  fact  gives 
rise  to  highly  nontrivial  complications,  discussed  later.) 

As  mentioned  earlier,  the  Pis  and  graduate  students  have  done  substantial  work  on  questions  of 
parameter  identifiability  (the  possibility  of  recovering  the  entries  of  the  matrices  A,  B,  and  C  from 
the  input/output  map  u(-)  i---  y{-)  of  the  system)  as  well  as  other  related  questions  such  as  state- 
observability.  The  issue  is  intimately  related  with  the  material  discussed  earlier  on  weight  determination 
for  feedforward  nets,  but  mathematically  the  topics  are  quite  different.  In  particular,  some  of  the  technical 
restrictions  regarding  a  can  be  relaxed,  because  in  this  context  it  is  often  the  case  that  the  initial  state  is 
an  equilibrium  state,  and  this  means  that  property  wip  is  closer  to  the  problem  than  property  IP.  On  the 
other  hand,  the  problem  becomes  harder  because  of  the  dynamics,  especially  in  continuous-time,  and 
tools  from  nonlinear  systems  must  be  employed. 

We  next  provide  some  more  details.  Depending  on  the  interpretation  (discrete  or  continuous  time), 
one  defines  an  appropriate  behavior  behj,  mapping  suitable  spaces  of  input  functions  into  spaces  of 
output  functions,  again  in  the  standard  sense  of  control  theory,  for  any  fixed  initial  state.  For  instance,  in 
continuous  time,  one  proceeds  as  follows:  For  any  measurable  essentially  bounded  «(•)  :  [O.T]  — 
denote  by  (p{t,  u)  the  solution  at  time  t  with  initial  state  x(0)  =  this  is  defined  on  fO,  T]  because  of 
the  global  Lipschitz  assumption.  For  each  input,  let  behyf  'u)  be  the  output  function  corresponding  to  the 
initial  state  x(0)  =  0,  that  is.  behT(u)(f)  :=  C(d>(t.O.  u)).  Two  recurrent  nets  Z  and  Z  (necessarily  with 
the  same  numbers  of  input  and  output  channels,  i.e.  with  p  =  p  and  m  =  m)  are  equEaleni.  denoted 
Z  ~  Z.  (in  discrete  or  continuous  time,  depending  on  the  context)  if  it  holds  that  behy  =  behf. 

Assu.me  from  now  on  that  a  is  infinitely  differentiabie  abound  zero,  and  that  it  satisfies  the  following 
extremely  mild  nonlinearity  assumption: 

cr(0)  =  0  ,  cr  (0)  0  ,  <7  (0)  =  0  ,  <7^'^^(0)  =:  0  for  some  q  >  2  .  (*) 


(Far  less  is  needed  for  the  results  to  be  quoted,  but  this  is  certainly  sufficient.  For  odd  analytic  functions, 
we  are  just  asking  that  a  be  nonlinear  and  nonsingular  at  zero.)  Let  S{n,  m,  p)  denote  the  set  of  all 
recurrent  nets  Z(  A,  B,  C,  a)  with  fixed  n.  m.pand  any  fixed  a  as  above.  Two  nets  Z  and  Z  in  S{  n,  m,  p) 
are  sign-permutation  equivalent  if  there  exists  a  nonsingular  matrix  T  such  that 


T-\\T  =  A,T~^B  =  B,CT  =  C, 


and  T  has  the  special  form:  T  =  PD.  where  P  is  a  permutation  matrix  and  D  —  diaglA; . A,.), 

with  each  A,  =  +1.  Tne  nets  Z  and  Z  are  just  permutation  equivalent  if  the  above  holds  with  D  =  I, 
that  is,  T  is  a  permutation  matrix. 

Let  be  the  class  of  n  x  m  real  matrices  B  for  which:  6,,,  0  for  all  i.  j,  and  for  each  i  ^  j, 

there  exists  some  k  such  that  \b^  ,A;1  m  For  any  choice  of  positive  integers  n,  in,  p,  denote  by 

of  ?.l]  of  rn^i^nc'^s  ^  'i  B 4  G  H  G  v/hich  cirs 

(observable  and  controllable).  This  is  a  generic  set  of  triples,  in  the  sense  that  the  entries  of  those  which 
do  not  satisfy  the  property  form  a  proper  Zariski-closed  subset  of  Finally,  let: 

S{n,m,p)  =  {z(A,5,C,a)|5eB"--and(A,5,C)G5^,„,p}  . 


Then,  in  [7],  a  general  result  was  proved,  for  a  somewhat  larger  class  of  systems,  which  in  particular 
implies: 


3  SOME  MORE  DETAILS 


24 


Assume  that  a  is  odd  and  satisfies  property  (*).  Then  Z  ~  Z  if  and  only  ifY.  and  Z  are 
sign-permutation  equivalent. 


(An  analogous  result  can  be  proved  when  a  is  not  odd,  resulting  in  just  permutation  equivalence.)  The 
proof  starts  with  the  obvious  idea  of  considering  the  parameters  as  extra  constant  states  and  computing  the 
observables  for  the  extended  systems;  this  involves  considering  iterated  Lie  derivatives  of  observations 
under  the  vector  fields  defining  the  system  and  is  a  routine  approach  in  nonlinear  control.  However, 
and  this  is  the  nontrivial  part,  extracting  enough  information  from  these  Lie  derivatives  is  not  entirely 
easy,  and  the  proof  centers  on  that  issue.  The  reference  cited  also  included  results  for  the  more  general 
class  of  systems  of  the  Hopfield-type  x  =  -Dx  +  a  {Ax  +  Bu),  as  well  as  results  showing  that  a 
itself  can  be  identified  from  the  black-box  data;  there  is  no  room  here  to  explain  all  these  results.  Also, 
discrete-time  results  are  now  available;  see  [35].  Moreover,  if  the  activation  is  analytic,  the  response  to 
a  single  long-enough  input  function  is  theoretically  enough  for  identification  of  all  parameters,  just  as 
impulse-responses  are  used  in  linear  systems  theory;  this  is  also  explained  in  [7]. 


A  result  on  observability  was  also  recently  given,  in  [8].  We  start  by  slighly  enlarging  the  class 
to  Bn.m<  the  class  of  those  B  G  vanishes  and  for  which  the  second 

condition  in  the  definition  of  B"-'"  (Vi  j,  3k  s.t.  |6_,_i|)  holds.  Fix  n,  m.  We  denote  by  S 

the  set  of  all  recurrent  nets  for  which  B  G  Bn,m  a  satisfies  the  property  IP.  Let  e,-,  i  =  I , . . . ,  n 
denote  the  canonical  basis  elements  in  R".  A  subspace  V  of  the  form  V  =  span  {e,-, , . . . ,  e,,}  ,  for 
some  I  >  0  and  some  subset  of  the  efs,  will  be  called  a  coordinate  subspace',  that  is,  coordinate 
subspaces  are  those  invariant  under  all  the  projections  rr,  :  R"'  —  rttCj  =  Sums  of 

coordinate  subspaces  are  again  of  that  form.  Thus,  for  each  pair  of  matrices  {A,  C)  with  A  G 
and  C  G  there  is  a  unique  largest  .4-invariant  coordinate  subspace  included  in  kerC;  we  denote 

this  by  C\(.4.6').  One  way  to  compute  =  Ofi  A.C)  is  by  the  following  recursive  procedure: 
O’?  :=  kerC,  :=  af  Q  Pi -T* (7^' Pi . . . Pi x"' k  =  <h. . r?  -  1 ,  (D-  :=  (7",  which 


can  be  implem.ented  by  an  algorithm  which  em;pioys  a  number  of  elementary'  algebraic  operations  which 


is  polynomial  on  the  size  of  n  and  m.  (Also  one  could  use  graph-theoretic  computations  instead.) 


The  two  main  results  in  the  above  reference  tone  for  discrete  and  another  one  for  continuous  time)  are 


summarized  as: 


Z  G  <5  is  observable  if  and  only  if  ker.4  f]  ker  C  —  0^{A,  C)  =  0. 

Surprisingly,  this  is  a  very  simple  characterization,  and  easy  to  check.  Since  the  corresponding  linear 
system  (if  a  would  have  been  the  identity)  is  observable  if  and  only  if  0(A,  C),  the  largest  .4-invariant 
subspace  included  in  kerC,  is  zero,  and  since  both  OfiA,C)  and  ker.4  D  kerC  are  subspaces  of 
0(.4,  C),  one  gets  in  particular  that  if  Z  G  iS  and  the  pair  of  matrices  (.4,  C)  is  observable  then  Z  is 
observ'able,  but  the  converse  does  not  hold. 


Recurrent  Nets  as  Universal  Systems  Models,  and  an  Algorithmic  Approach 

R.ecurrent  nets  provide  universal  identification  models,  in  a  suitable  sense.  To  be  more  precise,  we 
may  consider  continuous- or  discrete-time,  time-invariant  systems  Z:  i  [ora;"*"  ]  =  f{x,u)  ,y  =  h{x) 
under  standard  smoothness  assumptions  (for  instance,  x{t)  G  R",  u(t)  G  R"*,  and  y{t)  G  R'^  for  all 
t,  and  /  and  h  are  continuously  differentiable).  It  is  not  hard  to  prove  that,  on  compacts  and  for  finite 
time  intervals,  the  behavior  of  Z  can  be  approximately  simulated  by  the  behavior  of  a  recurrent  network, 
assuming  that  one  uses  any  activation  <t  which  is  universal,  in  the  sense  that  dilates  and  translates  of  a 
are  dense  on  continuous  functions  with  the  compact-open  topology.  (If  a  is  locally  Riemann  integrable, 
it  is  a  universal  activation  if  and  only  if  it  is  not  a  polynomial,  as  known  from  recent  results  by  Leshno 
and  others.) 
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By  approximate  simulation  we  mean  as  follows.  In  general,  assume  given  two  systems  I  and  Z  as 
above,  where  tildes  denote  data  associated  to  the  second  system,  and  with  same  number  of  inputs  and 
outputs  (but  possibly  n  ^  n).  Suppose  also  given  compact  subsets  A'l  C  R'*  and  A't  C  R'",  as  well 
as  an  s  >  0  and  a  A  >  0.  Suppose  further  (this  simplifies  definitions,  but  can  be  relaxed)  that  for  each 
initial  state  xq  E  A'l  and  each  measurable  control  u(-)  :  [0,  T]  Ki  the  solution  xq,  u)  is  defined 
for  all  t  E  [0,  T],  The  system  Z  simulates  Z  on  the  sets  A'l,  A'2  in  time  T  and  up  to  accuracy  e  if  there 
exist  two  continuous  mappings  a  :  R”  —  R"  and  (3  :  R”  — +  R"  so  that  the  following  property  holds; 
For  each  xq  E  K\  and  each  u{-)  :  [0,r]  — >  A'2,  denote  x{t)  ;=  ^(f,  xq,  u)  and  x{t)  :=  (p{t,  (3{xq),  u)\ 
then  this  second  function  is  defined  for  all  t  E  [0,  T],  and 

\\x{t)  -  a(J(f))||  <  £  ,  \\h{x(t))  -  fi(?(f))||  <  £ 

for  all  such  t.  One  may  ask  for  more  regularity  properties  of  the  maps  a  and  0  as  part  of  the  definition; 
in  any  case  the  maps  constructed  in  the  reference  given  below  are  at  least  differentiable.  Assume  that  a 
is  a  universal  activation,  in  the  sense  defined  earlier.  'Then, /or  each  system  Z  and  for  each  K\,  A'2,  e,  T 
as  above,  there  is  a  a-system  Z  that  simulates  Z  on  the  sets  K\ ,  A’2  in  time  T  and  up  to  accuracy  e.  The 
proof  if  not  hard,  and  it  involves  first  simply  using  universality  in  order  to  approximate  the  right-hand 
side  of  the  original  equation,  and  then  introducing  dynamics  for  the  “hidden  units”  consistently  with  the 
equations.  'This  second  part  requires  a  little  care;  for  details,  see  for  instance  [31].  (Some  variations  of 
this  result  were  given  earlier  and  independently  in  work  by  Polycarpou  and  loannou,  and  by  Matthews, 
under  more  restrictive  assumptions  and  with  somewhat  different  definitions  and  conclusions.) 

'Thus,  recurrent  nets  approximate  a  wide  class  of  nonlinear  plants.  Note,  however,  that  approxima¬ 
tions  are  only  valid  on  compact  subsets  of  the  state  space  and  for  finite  time,  so  that  many  interesting 
dynamical  characteristics  are  not  reflected.  Since  recurrent  nets  are  being  used  in  many  application 
areas  (recall  the  earlier  discussion),  this  universality  property  is  of  som,e  interest.  "This  is  analogous  to 
the  role  of  bilinear  systems,  which  had  been  proposed  previously  (work  by  Riess  and  by  the  second 
PI  in  the  mid-1970s)  as  universal  models.  As  with  bilinear  systems,  it  is  obvious  that  if  one  imposes 
extra  stability  assumptions  (“fading  memory"  type)  it  will  be  possible  to  obtain  global  approximations, 
but  this  is  probably  not  very  useful,  as  stability  is  often  a  goal  of  control  rather  than  an  assumption.  In 
any  case,  the  results  justify  the  study  of  recurrent  nets  at  least  as  much  as  they  justify  the  investment 
in  the  study  of  bilinear  systems.  .And  the  parameter  identification  result  mentioned  earlier  seems  to  say 
that  they  are  from  an  estimation  viewpoint  more  appealing.  (Bilinear  realization  provides  uniqueness 
of  internal  parameters  only  up  to  GL{n).) 

'The  philosophy  of  systems  modeling  using  this  approach  would  be  to  postulate  a  model  of  this 
form  (using  the  fact  that  the  true  plant  can  be  so  approximated)  and  then  to  identify  parameters  under 
this  assumption.  'This  is  fairly  much  what  is  done  in  practice  with  linear  systems;  tme  systems  are 
nonlinear,  but  linear  models  often  approximate  well-enough  that  the  approach  makes  sense.  (Of  course, 
in  practice  one  needs  to  account  for  noise,  and  robust  identification  techniques  are  being  developed,  in 
the  linear  case,  to  deal  with  the  fact  that  the  true  plant  is  not  in  the  class.)  In  any  case,  next  we  suggest  a 
possible  approach  to  numerical  identification  of  a  recurrent  model  for  a  very  special  (but  still  universal) 
activation.  'The  idea  is  to  use  a  Fourier-type  expansion  in  the  right-hand  side  of  equations.  We  have 
recently  started  looking  at  the  use  of  cr[x)  =  exp(2x)  as  the  activation  function;  let’s  for  now  call  a 
recurrent  net  with  this  activation  function  an  exp-system.  'This  is  a  natural  function  to  use  as  we  can  see 
from  its  utilization  in  Fourier  analysis.  With  the  introduction  of  complex  numbers  and  the  use  of  the 
exponential,  we  are  able  to  reduce  a  complicated  nonlinearity  in  the  output  to  a  product.  The  benefit 
is  that  we  may  use  certain  input/output  pairs  to  exactly  identify  the  parameters  A,  B,C,  xq  for  a.  given 
exp-system  in  closed  form,  with  no  need  to  solve  any  nonlinear  programming  problems.  To  give  the 
flavor  of  the  algorithm  that  is  possible  -assuming  to  start  with  that  there  is  no  noise  in  measurements- 
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suppose  we  have  a  system  of  known  dimension  n  which  is  modeled  by  an  exp-system,  but  with  unknown 
values  of  A,  B,  C,  xq.  We  take  the  single-input  case  for  simplicity  (m=l)  and  allow  for  complex  entries 
in  A,  B,  C,  Xq. 

Assuming  we  may  reset  the  system  to  the  (mathematically  unknown  but  presumably  physically 
known)  initial  state  and  apply  any  input  to  obtain  the  corresponding  output,  we  can  identify  those 
parameters  using  the  following  procedure.  To  compute  the  values  of  B,  we  apply  inputs  of  length 
one  with  integer  values  0,1,... ,2n  -  1.  The  resulting  2n  output  values  Ca{Axo),  Ca{AxQ  -f 
B), . . . ,  Ca(Axo  +  (2n  -  can  be  written  as 

Cl  C  diag((7(6i ),...,  o'(6„))  1  ,...  ,  C  diag  i 

where  1  is  the  column  vector  ( 1 , . . . ,  1 )'.  Here  C  is  some  vector,  in  general  different  from  C  if  xq  ^  0. 
These  elements  can  be  seen  as  the  first  2n  Markov  parameters  (or  impulse  response  parameters)  for 
an  n-dimensional  linear  system  whose  “A”  matrix  has  eigenvalues  exactly  equal  to  cr{bi), . .  .,a(bn). 
These  eigenvalues  can  be  obtained  by  applying  linear  realization  techniques  to  the  sequence.  This 
step  of  the  algorithm  is  based  on  Hankel-matrix  techniques  that  are  classical  in  the  context  of  linear 
recurrences  and  their  multivariable  extensions  developed  in  control  theory.  The  method  appears  in 
many  other  areas;  for  instance  in  coding  theory  for  decoding  BCH  codes,  and  in  learning  theory  for 
sparse  polynomial  interpolation.  In  order  to  actually  compute  6], . .  one  would  need  to  apply 
the  “inverse”  of  <t;  this  introduces  some  complication  since  there  are  many  complex  logarithms,  but 
It  can  be  taken  care  of  by  applying  a  different  set  of  multiples  of  a  different  input  value.  The  next 
step  in  the  identification  procedure  is  to  determine  the  vector  C.  For  this,  apply  inputs  of  length  2 
with  integer  values  ranging  from.  0  to  r?  -i-  1  for  the  first  input  and  0  to  n  -  1  for  the  second.  This 
provides  a  Vandermonde  system  of  equations  which  can  then  be  solved  (again  there  are  complications 
because  ot  complex  logs).  Finally,  .4  and  xq  can  be  obtained.  Vv'e  omit  more  details  here,  as  in  any  case 
not  all  aspects  are  yet  well  understood.  But  in  a  just-completed  Ph.D.  dissertation,  in  work  supported 
under  this  project,  our  graduate  student  Koplon  tcf.  [43],  [42])  has  provided  a  theoretical  framework 
and  programmed  a  quick  MATLAB  implementation  of  the  above  algorithm,  and  without  any  noise  in 
measurements  it  performs  correctly,  retrieving  an  unknown  system.  This  is  a  major  research  area,  of 
which  we  have  barely  scratched  the  surface.  Much  is  needed,  from  the  use  of  least-squares  techniques 
and  methods  based  on  Hankel-norm  approximation  to  deal  with  noise  (or  model  mismatch),  to  the 
development  of  efficient  algorithms,  to  the  study  of  algorithms  that  use  random  as  opposed  to  chosen 
data  (in  learning  terms,  online  as  opposed  to  query-based). 
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